Supervised classification of law area in the legal domain

Size: px
Start display at page:

Download "Supervised classification of law area in the legal domain"

Transcription

1 AFSTUDEERPROJECT BSC KI Supervised classification of law area in the legal domain Author: Mees FRÖBERG ( ) Supervisors: Evangelos KANOULAS Tjerk DE GREEF June 24, 2016

2 Abstract Search algorithms have been implemented to make online legal data more accessible. These could be improved by the use of meta-data (e.g. law area), although it is often missing. Text classification algorithms have been used to automatically retrieve this meta-data, even though the performance of these is not high enough to be of practical use. Therefore closer cooperation between legal and machine learning experts is suggested. In this research the performance of flat and hierarchical classification algorithms and the effects of different kinds of features is compared. The results indicate that the overall performance of the hierarchical classification algorithm is best, although the train and test time of the algorithms is significantly larger. Term frequency models also perform slightly better compared to other feature extraction methods as LDA and word vectors, even though they take more time to test and train. 1

3 Contents 1 Introduction 3 2 Related Literature 3 3 Method Classification Algorithms Flat classification Hierarchical classification Features Term Frequency Term Frequency-Inverse Document Frequency Word Vector Latent Dirichlet Allocation Evaluation Results 9 5 Conclusion 11 6 Appendix A 12 2

4 1 Introduction The Internet contains an enormous number of legal documents which is continuously growing. Search algorithms have been implemented to make this data more accessible. These could be improved by the use of meta-data (e.g. law area) attached to the legal documents. Some documents contain meta-data, although most lack it. Manually labeling these documents is time-consuming because of the enormous and increasing number of legal documents. Thus text classification algorithms, which take considerably less time to label documents, are implemented to automatically classify legal documents. Previous research has shown that Multilabel classification algorithms can be used to classify legal documents (Mencía & Fürnkranz, n.d.). Usually meta-data is of a hierarchical structure (as is law area). E.L. Mencía and J. Fürnkranz, however, make a flat prediction, hence not using the hierarchical structure of the meta-data. This leads to the following question: How do hierarchical classification algorithms compare to flat classification algorithms with respect to the legal domain in terms of performance? Different features can be extracted from the documents, which function as input for the classification algorithms. These have a substantial influence on the performance of classification models (arguably the most influence). Therefore the second question that will be addressed is: What kinds of features should be used to enhance the performance of classification models? The overall question that will be answered is: How do hierarchical algorithms compare to flat classification algorithms with respect to the legal domain in terms of performance and what kinds of features enhance this performance? A public dataset of approximately dutch labeled legal documents have been used to answer this question. This dataset is skewed, which means that there is more information about certain categories compared to others. Consequently this will influence the performance of the classification algorithms. The next section (Section 2) describes some related research that has been used to answer this question. The research method is explained in Section 3 and subsequently the results are shown and examined in Sections 4 and 5. 2 Related Literature Research has shown that the performance of classification algorithms for legal documents is not high enough to be of practical use (Governatori, 2009). Governatori suggested that closer cooperation between legal and machine learning experts is needed to improve this performance. Thus demonstrating the need for this research. Previous research in text classification of legal documents indicated that multilabel classification algorithms (flat classification) can be used to classify legal documents (Mencía & Fürnkranz, n.d.). This is applied on the legal domain and is related to the classification of law area. Consequently this research could also apply to the classification into law area. The multilabel algorithm that is used by E.L. Mencía and J. Fürnkranz is based on a support vector machine. Therefore the flat classification algorithm used in this research will also be based on support vector machines. Fan et al. present a linear support vector machine that is efficient for training on large-scale problems (Fan et al., 2008). This classifier will be used, since the dataset is large and some features could present sparsity problems. 3

5 Other research purported that hierarchical classification methods can be used to classify documents into pre-defined topic hierarchies (Cai & Hofmann, 2004). Since most meta-data (including law area) has a hierarchical structure, such algorithms could be successful in classifying law area. Therefore it could perform better in the legal domain compared to a flat classification algorithm. W. Bi and J.T. Kwok recently presented a novel hierarchical multilabel classification algorithm that can be used on both tree- and DAG-structured hierarchies (Bi & Kwok, 2011). They claim that it does not suffer from the problem of insufficient/skewed training data in classifier training. This is the primary reason that this algorithm will be used and tested, since the available dataset is skewed. There are numerous different features that can be extracted from text documents and used by classification algorithms. Research has demonstrated that Latent Dirichlet Allocation (LDA) has been successful in finding topics of legal documents (Raghuveer, 2012). Hoffman et al. developed an efficient Variational Bayes algorithm for LDA (Hoffman et al., 2010) which will be used for this research. Other research has shown that word vectors can represent the meaning of a word with high accuracy (Mikolov et al., 2013). Previously mentioned features will be extracted from the available dataset and the effect of these features on the performance of classification models will be measured and compared with each other and with bag of words implementations. 3 Method The overall approach to answer the research question is adapting a flat classification algorithm and hierarchical classification algorithm to this problem. Once those are implemented they can be trained and tested with different features. After these experiments are conducted the results can be evaluated by comparing the different performances of the classification algorithms and features with each other. 3.1 Classification Algorithms The overall approach of every classification algorithm is the same. It requires some training data {(x i,y i )} n i=1. Where x i is a vector representing a data point (document) and y i the associated class with that data point. Once the algorithm has used the training data to adapt its parameters to the data it can be used to classify an unseen data point into a class. It is a multilabel classification problem if there are more than two classes. In this research two different multilabel classification algorithms are compared. More specifically a flat multilabel classification algorithm will be compared with a hierarchical multilabel classification algorithm. The main difference between them is that a flat classification algorithm does not use the structure of the labels that it classifies in, while the hierarchical one does. Subsequently this should present a difference in performance. Even though both algorithms are different they should also have some similarities to be able to conduct this research. They should be able to classify into multiple labels, for example a legal document could contain information about multiple law areas. This is often the case when observing legal documents Furthermore the performance measurement of both algorithms should be equivalent. These conditions are necessary to be able to compare both algorithms. How the performance is measured and the algorithms are compared is explained in the Evaluation section (3.3). 4

6 3.1.1 Flat classification The flat classification algorithm, used in this research, exerts a one-vs-all strategy. This strategy trains a classifier for every class. For each classifier, the class is fitted against all the other classes. Thus if there are d different classes, d binary classifiers are trained. Given a data point, these classifiers can calculate a confidence score for every class. Once all the confidences are calculated the classification algorithm selects the classes with high confidence scores. Subsequently the algorithm will be able to classify into multiple classes. Any classification algorithm that can calculate such a confidence score is applicable to the ove-vs-all method. For this research linear support vector machines are used in combination with the one-vs-all strategy to make a multilabel prediction. E.L. Mencía and J. Fürnkranz successfully used support vector machines in their research (Mencía & Fürnkranz, n.d.). Fan et al. describe this classification algorithm in their paper (Fan et al., 2008). The basic idea of this classifier is that it creates a hyperplane (or multiple hyperplanes) that is used to separate the data from each other. The distance of a data point to this hyperplane can be used to calculate a confidence score. Subsequently this confidence score can be used by the one-vs-all method to classify the data point. Figure 1: An example hierarchy, the black nodes are leaves of the hierarchy. When classifying into the leaves the classes higher up in the hierarchy can also be obtained. Given this algorithm data points can be classified into multiple classes, even though this does not necessarily imply that it is possible to classify into a pre-defined hierarchy. This can be achieved by classifying into the leaves of the hierarchy (see Figure 1). Subsequently the labels of all data points have to contain a leaf. Since this is not the case with law area the flat multilabel classification algorithm, used in this research, will classify on all the different classes in the hierarchy Hierarchical classification Hierarchical classification algorithms use the predefined structure of the classes (hierarchy). W. Bi and J.T. Kwok recently implemented such an algorithm (Bi & Kwok, 2011). Their algorithm requires a L value that represents the number of labels that have to be assigned to the data point. For this research L is set to 2, since most legal documents are labeled with two law areas. is similar to the flat classification algorithm in that a linear support vector machine is trained for each class, however this algorithm uses PCA to reduces the number of classifiers that have to be trained. Consequently reducing the negative effects of skewed data, because all the the classes 5

7 in the projected space can learn from the whole training set. Once these classifiers predicted their confidence values the resulting matrix is projected back to its former class space. ((a)) Initialized supernodes ((b)) After one iteration ((c)) After two iterations Figure 2: Two iterations of CSSA. The node in the hierarchy is black if its ψ is equal to 1. The blue bubbles around nodes in the hierarchy represent supernodes. The value within the node is the confidence score of the class. After these confidence scores are calculated a Condense Sort and Select Algorithm (CSSA) is used to classify the data point into the predefined hierarchy. If there are d different classes in the hierarchy, {ψ i } d i=1, where ψ i {0,1}, represents the list of assigned labels to the data point. CSSA initiates ψ 0 1 (root of the hierarchy) and all nodes as supernodes. A supernode is a list of nodes. It is assigned a supernode value (SNV) which is the average of the confidence values over all its constituent nodes (Bi & Kwok, 2011). CSSA sorts all the supernodes on SNV and the supernode with the highest SNV is selected. If the ψ value of the parent of the supernode is equal to 1 the ψ values of all nodes in the supernode are assigned to 1, else the supernode is condensed with its parent supernode (see Figure 2). This proces continues until the number of assigned nodes is greater than or equal to L. 3.2 Features The previously described classification algorithms require vectors, that represent the data points. The available dataset comprises legal documents, therefore every data point is a document. There are multiple techniques to extract these vectors from documents. The next subsections describe different features that are used in this research. Once those features are extracted from the data they can be compared (further explained in Section 3.3) Term Frequency A primitive approach of quantifying a document into a vector is retrieving the Term Frequency (TF) of the words (terms) in the document. This is also known as Bag of Words (BOW) models. Let k be the number of distinct terms in a corpus. A BOW model can calculate a k-dimensional vector that represents the given document. Every dimension in this vector represents a term, the value of that dimension is equal to the number of times the term occurs in the document. Terms that have not been seen by the BOW model are discarded. Consequently the BOW model should cover most (if not all) of the terms in the document to prevent information loss. However, note that a term does not necessarily have to be a word, it could also be a sequence of words. While creating a BOW model the size of this sequence of words has to be defined. If this 6

8 size is equal to one the terms are called unigrams. This research only covers unigrams, even though bigrams and trigrams could be used as well. For a more extensive research towards the effects of different n-grams on document classification into law area could be explored as well Term Frequency-Inverse Document Frequency Some words in dutch text do not carry a lot of meaning (e.g. de, het, een). Even though TF gives these terms a very high value. Therefore some adjustment were made to give meaningful terms a higher value compared to terms with little meaning. This resulted in Term Frequency-Inverse Document Frequency (TF-IDF). This is achieved by extracting the term counts and dividing them with the document frequency of that term. The document frequency of a term is the number of documents where the term occurs. Dutch words like de and het occur in almost all documents. Subsequently the value of this term decreases significantly. Consequently TF-IDF should give a more precise representation of a document compared to TF Word Vector Besides term frequency, there are other approaches of quantifying a document into a vector space. Mikolov et al. proposed two new methods that efficiently estimate the meaning of words in vector space (Mikolov et al., 2013). One of those methods is the Continuous Bag of Words model (CBOW), which is used in this research. It uses the context of a word to estimate the the meaning of it. Figure 3: Overview of the Continuous Bag of Words model CBOW contains three layers (see Figure 3). In the input layer 1-of-V codings of the surrounding words are created. The projection layer projects them into the same position. Subsequently the output layer adds these projections to each other, resulting in the word vector. The projection layer can project the words into different sizes, this could influence the classification process. Therefore different word vector sizes will be tested and compared during this research. There are 32 leaves in 7

9 the law area hierarchy, consequently the lowest vector size that will be evaluated is 32. The other sizes are 50, 100 and Latent Dirichlet Allocation The final feature that will be analysed in this research is Latent Dirichlet Allocation (LDA). This method assumes that every document can be seen as a collection of multiple topics. Hoffman et al. developed a Variational Bayes (VB) for LDA (Hoffman et al., 2010). Given K different topics and a set of documents a LDA model can be trained. For every distinct word in the corpus K topic weights are estimated using the training set. Subsequently these are used to calculate probabilities for each topic of a given document. These are put in a vector that can be used by classification algorithms. Similar to CBOW, the dimensionality of LDA depends on the given number of different topics (K). Different K s will be evaluated. Similar to word vectors the lowest K that will be trained is 32. Furthermore LDA models with a K of 50, 100 and 200 will be trained and tested. 3.3 Evaluation Once all the models and algorithms can be used they are ready for evaluation. This evaluation is divided into two parts. On one side there is the evaluation of the classification algorithms and on the other side is the evaluation of the features. Both evaluations use the same performance measurement. This includes the the following values: - Accuracy - Main accuracy - Sub accuracy - Precision (micro/macro) - Recall (micro/macro) - F1-score (micro/macro) - Training time - Testing time The hierarchy of law areas contains two layers, the main layer and the sub layer (see Figure 4). Main accuracy represents the accuracy on the law areas in the main layer and sub accuracy on the law areas in the sub layer. The difference between macro and micro metrics is that macro calculates the metrics for each label separately and takes the average of those values, while micro does this globally. Due to skewed data the macro metrics will probably be significantly lower compared to the micro metrics, since the algorithm receives less information about certain law areas. Consequently the performance on these law areas will probably be lower compared to the other law areas. All these measurements provide some insight into the performance of the classification algorithm. To prevent overfitting k-fold cross validation is applied. The whole dataset will be divided into 5 sets. Five models will be trained and tested for every algorithm and feature combination, although every time on another train and test set. Once these models are trained their performance is measured and the average of all five models is taken. This process is repeated for multiple train sizes. This will provide some insight into the effects of different train sizes on the performance of the algorithm. Even though the highest training set will probably perform best, this will indicate whether a very large training set is necessary to acquire a good performance. 8

10 Figure 4: Law area hierarchy Once all these performance values are obtained the differences between the flat and hierarchical classification algorithms will be compared per feature. This should provide some insight in the difference of flat and hierarchical classification algorithms with respect to the legal domain in terms of performance. The evaluation of the features is a bit more complicated, since the dimensionality of the features differ. Ideally the features would quantify every document into vectors of the same size. LDA and word vectors can adjust this and by applying principal component analysis on TF and TF-IDF this could be achieved, although this would cause information loss and therefore pessimal usage of the features. Consequently the best LDA and word vector models for the flat and hierarchical classification algorithms will be used for the evaluation of the features. The features will be compared for each classification separately (The results from the flat classification algorithm will be compared with one another idem for the hierarchical classification algorithm). 4 Results Several experiments were conducted during this research. Every algorithm and feature extraction method was implemented in python. All the measurements that were taken during this research are presented in Appendix A. The biggest train set that was used contained approximately documents, which is 50% of the available train set. The other models were trained on train set sizes of 1%, 10%, 20%, 30% and 40%. While the size increases the performance and train/test time increase as well, even though this increase is significantly lower after train set sizes of 40% (see Appendix A). The macro metrics are for all the models significantly lower than the micro metrics. This is probably caused by the skewness of the data (see Table 6). The overall performance of the hierarchical classification algorithm is higher compared to the flat classification algorithm (see Appendix A). The accuracies, especially the sub accuracy, of the hierarchical classification algorithm are significantly higher, although the micro and macro metrics are similar. 9

11 The performance of the different word vector and LDA models on the flat and hierarchical classification algorithm were compared using models trained on 50% of the train set. Their differences are displayed in Figures 5 and 6. These graphs indicate that while the dimensionality increases the performance increases as well, even though LDA200 performs slightly better compared to LDA100 (see Table 6). The training time of LDA200 is the only value that is significantly higher compared to LDA100. Flat Hierarchical Dimensionality Dimensionality Figure 5: WORDVEC performance measurements. The yellow bars represent the accuracy values, the blue bars the micro F1-scores and the red the macro F1-scores Flat Hierarchical Dimensionality Dimensionality Figure 6: LDA performance measurements. The yellow bars represent the accuracy values, the blue bars the micro F1- scores and the red the macro F1-scores 10

12 5 Conclusion The results from this research suggest that hierarchical classification algorithms perform better compared to flat classification algorithms with respect to the legal domain, although they are more time consuming. Similarly Term Frequency-Inverse Document Frequency enhances these algorithms the most and takes more time to train and test compared to the other features. The difference between the word vector model and TF-IDF is not substantial except the train time. The word vector model does take less time to train. Even though the train time should not have a big influence on the performance of classification algorithms, since an algorithm only has to be trained once. Consequently this research indicates that TF-IDF enhances the performance of classification algorithms the most of the features addressed in this research. Even though it is hard to compare TF and TF-IDF with the other features, since the dimensionality differs. The vectors that are produced by TF-IDF are significantly larger compared to those of LDA or word vector models. Therefore these features have more room to express itself in. This research provides some insight into the classification of legal documents. However there are still gaps that could be explored. In future work other features could be evaluated, new features (e.g. Named Entities) could be explored or the features discussed in this research could be combined. It could be possible that a combination of different features performs even better. Furthermore different flat and hierarchical classification algorithms could be compared to amplify the claims made in this research, if the same results can be derived from the results. 11

13 6 Appendix A In the following tables all the results are displayed (Tables 1, 2, 3, 4, 5, 6). accuracy main accuracy sub accuracy micro precision Flat TF TF-IDF WORDVEC WORDVEC WORDVEC WORDVEC LDA LDA LDA LDA Hierarchical TF TF-IDF WORDVEC WORDVEC WORDVEC WORDVEC LDA LDA LDA LDA micro recall micro f1 macro precision macro recall macro f1 train time (s) test time (s) Table 1: Results of models trained on 1% of the whole train set 12

14 accuracy main accuracy sub accuracy micro precision Flat TF TF-IDF WORDVEC WORDVEC WORDVEC WORDVEC LDA LDA LDA LDA Hierarchical TF TF-IDF WORDVEC WORDVEC WORDVEC WORDVEC LDA LDA LDA LDA micro recall micro f1 macro precision macro recall macro f1 train time (s) test time (s) Table 2: Results of models trained on 10% of the whole train set accuracy main accuracy sub accuracy micro precision Flat TF TF-IDF WORDVEC WORDVEC WORDVEC WORDVEC LDA LDA LDA LDA Hierarchical TF TF-IDF WORDVEC WORDVEC WORDVEC WORDVEC LDA LDA LDA LDA micro recall micro f1 macro precision macro recall macro f1 train time (s) test time (s) Table 3: Results of models trained on 20% of the whole train set 13

15 accuracy main accuracy sub accuracy micro precision Flat TF TF-IDF WORDVEC WORDVEC WORDVEC WORDVEC LDA LDA LDA LDA Hierarchical TF TF-IDF WORDVEC WORDVEC WORDVEC WORDVEC LDA LDA LDA LDA micro recall micro f1 macro precision macro recall macro f1 train time (s) test time (s) Table 4: Results of models trained on 30% of the whole train set accuracy main accuracy sub accuracy micro precision Flat TF TF-IDF WORDVEC WORDVEC WORDVEC WORDVEC LDA LDA LDA LDA Hierarchical TF TF-IDF WORDVEC WORDVEC WORDVEC WORDVEC LDA LDA LDA LDA micro recall micro f1 macro precision macro recall macro f1 train time (s) test time (s) Table 5: Results of models trained on 40% of the whole train set 14

16 accuracy main accuracy sub accuracy micro precision Flat TF TF-IDF WORDVEC WORDVEC WORDVEC WORDVEC LDA LDA LDA LDA Hierarchical TF TF-IDF WORDVEC WORDVEC WORDVEC WORDVEC LDA LDA LDA LDA micro recall micro f1 macro precision macro recall macro f1 train time (s) test time (s) Table 6: Results of models trained on 50% of the whole train set 15

17 References Bi, W., & Kwok, J. T. (2011). Multi-label classification on tree-and dag-structured hierarchies. In Proceedings of the 28th international conference on machine learning (icml-11) (pp ). Cai, L., & Hofmann, T. (2004). Hierarchical document categorization with support vector machines. In Proceedings of the thirteenth acm international conference on information and knowledge management (pp ). Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., & Lin, C.-J. (2008). Liblinear: A library for large linear classification. The Journal of Machine Learning Research, 9, Governatori, G. (2009). Exploiting properties of legislative texts to improve classification accuracy. In Legal knowledge and information systems: Jurix 2009, the twenty-second annual conference (Vol. 205, p. 136). Hoffman, M., Bach, F. R., & Blei, D. M. (2010). Online learning for latent dirichlet allocation. In advances in neural information processing systems (pp ). Mencía, E. L., & Fürnkranz, J. (n.d.). Efficient multilabel classification algorithms for large-scale problems in the legal domain. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arxiv preprint arxiv: Raghuveer, K. (2012). Legal documents clustering using latent dirichlet allocation. IAES Int. J. Artif. Intell, 2(1),

Karami, A., Zhou, B. (2015). Online Review Spam Detection by New Linguistic Features. In iconference 2015 Proceedings.

Karami, A., Zhou, B. (2015). Online Review Spam Detection by New Linguistic Features. In iconference 2015 Proceedings. Online Review Spam Detection by New Linguistic Features Amir Karam, University of Maryland Baltimore County Bin Zhou, University of Maryland Baltimore County Karami, A., Zhou, B. (2015). Online Review

More information

CS229 Final Project: Predicting Expected Response Times

CS229 Final Project: Predicting Expected  Response Times CS229 Final Project: Predicting Expected Email Response Times Laura Cruz-Albrecht (lcruzalb), Kevin Khieu (kkhieu) December 15, 2017 1 Introduction Each day, countless emails are sent out, yet the time

More information

Automatic Domain Partitioning for Multi-Domain Learning

Automatic Domain Partitioning for Multi-Domain Learning Automatic Domain Partitioning for Multi-Domain Learning Di Wang diwang@cs.cmu.edu Chenyan Xiong cx@cs.cmu.edu William Yang Wang ww@cmu.edu Abstract Multi-Domain learning (MDL) assumes that the domain labels

More information

Feature selection. LING 572 Fei Xia

Feature selection. LING 572 Fei Xia Feature selection LING 572 Fei Xia 1 Creating attribute-value table x 1 x 2 f 1 f 2 f K y Choose features: Define feature templates Instantiate the feature templates Dimensionality reduction: feature selection

More information

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2018

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2018 CPSC 340: Machine Learning and Data Mining Deep Learning Fall 2018 Last Time: Multi-Dimensional Scaling Multi-dimensional scaling (MDS): Non-parametric visualization: directly optimize the z i locations.

More information

Creating a Classifier for a Focused Web Crawler

Creating a Classifier for a Focused Web Crawler Creating a Classifier for a Focused Web Crawler Nathan Moeller December 16, 2015 1 Abstract With the increasing size of the web, it can be hard to find high quality content with traditional search engines.

More information

Keyword Extraction by KNN considering Similarity among Features

Keyword Extraction by KNN considering Similarity among Features 64 Int'l Conf. on Advances in Big Data Analytics ABDA'15 Keyword Extraction by KNN considering Similarity among Features Taeho Jo Department of Computer and Information Engineering, Inha University, Incheon,

More information

TEXT CATEGORIZATION PROBLEM

TEXT CATEGORIZATION PROBLEM TEXT CATEGORIZATION PROBLEM Emrah Cem Department of Electrical and Computer Engineering Koç University Istanbul, TURKEY 34450 ecem@ku.edu.tr Abstract Document categorization problem gained a lot of importance

More information

CPSC 340: Machine Learning and Data Mining. Multi-Dimensional Scaling Fall 2017

CPSC 340: Machine Learning and Data Mining. Multi-Dimensional Scaling Fall 2017 CPSC 340: Machine Learning and Data Mining Multi-Dimensional Scaling Fall 2017 Assignment 4: Admin 1 late day for tonight, 2 late days for Wednesday. Assignment 5: Due Monday of next week. Final: Details

More information

A Measurement Design for the Comparison of Expert Usability Evaluation and Mobile App User Reviews

A Measurement Design for the Comparison of Expert Usability Evaluation and Mobile App User Reviews A Measurement Design for the Comparison of Expert Usability Evaluation and Mobile App User Reviews Necmiye Genc-Nayebi and Alain Abran Department of Software Engineering and Information Technology, Ecole

More information

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München Evaluation Measures Sebastian Pölsterl Computer Aided Medical Procedures Technische Universität München April 28, 2015 Outline 1 Classification 1. Confusion Matrix 2. Receiver operating characteristics

More information

Behavioral Data Mining. Lecture 18 Clustering

Behavioral Data Mining. Lecture 18 Clustering Behavioral Data Mining Lecture 18 Clustering Outline Why? Cluster quality K-means Spectral clustering Generative Models Rationale Given a set {X i } for i = 1,,n, a clustering is a partition of the X i

More information

Large Scale Chinese News Categorization. Peng Wang. Joint work with H. Zhang, B. Xu, H.W. Hao

Large Scale Chinese News Categorization. Peng Wang. Joint work with H. Zhang, B. Xu, H.W. Hao Large Scale Chinese News Categorization --based on Improved Feature Selection Method Peng Wang Joint work with H. Zhang, B. Xu, H.W. Hao Computational-Brain Research Center Institute of Automation, Chinese

More information

Multimodal Medical Image Retrieval based on Latent Topic Modeling

Multimodal Medical Image Retrieval based on Latent Topic Modeling Multimodal Medical Image Retrieval based on Latent Topic Modeling Mandikal Vikram 15it217.vikram@nitk.edu.in Suhas BS 15it110.suhas@nitk.edu.in Aditya Anantharaman 15it201.aditya.a@nitk.edu.in Sowmya Kamath

More information

String Vector based KNN for Text Categorization

String Vector based KNN for Text Categorization 458 String Vector based KNN for Text Categorization Taeho Jo Department of Computer and Information Communication Engineering Hongik University Sejong, South Korea tjo018@hongik.ac.kr Abstract This research

More information

Sentiment Classification of Food Reviews

Sentiment Classification of Food Reviews Sentiment Classification of Food Reviews Hua Feng Department of Electrical Engineering Stanford University Stanford, CA 94305 fengh15@stanford.edu Ruixi Lin Department of Electrical Engineering Stanford

More information

Word2vec and beyond. presented by Eleni Triantafillou. March 1, 2016

Word2vec and beyond. presented by Eleni Triantafillou. March 1, 2016 Word2vec and beyond presented by Eleni Triantafillou March 1, 2016 The Big Picture There is a long history of word representations Techniques from information retrieval: Latent Semantic Analysis (LSA)

More information

Domain-specific user preference prediction based on multiple user activities

Domain-specific user preference prediction based on multiple user activities 7 December 2016 Domain-specific user preference prediction based on multiple user activities Author: YUNFEI LONG, Qin Lu,Yue Xiao, MingLei Li, Chu-Ren Huang. www.comp.polyu.edu.hk/ Dept. of Computing,

More information

Text Categorization (I)

Text Categorization (I) CS473 CS-473 Text Categorization (I) Luo Si Department of Computer Science Purdue University Text Categorization (I) Outline Introduction to the task of text categorization Manual v.s. automatic text categorization

More information

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,

More information

Link Prediction for Social Network

Link Prediction for Social Network Link Prediction for Social Network Ning Lin Computer Science and Engineering University of California, San Diego Email: nil016@eng.ucsd.edu Abstract Friendship recommendation has become an important issue

More information

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani Feature Selection CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Dimensionality reduction Feature selection vs. feature extraction Filter univariate

More information

Applying Supervised Learning

Applying Supervised Learning Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains

More information

International Journal of Advance Engineering and Research Development. A Facebook Profile Based TV Shows and Movies Recommendation System

International Journal of Advance Engineering and Research Development. A Facebook Profile Based TV Shows and Movies Recommendation System Scientific Journal of Impact Factor (SJIF): 4.72 International Journal of Advance Engineering and Research Development Volume 4, Issue 3, March -2017 A Facebook Profile Based TV Shows and Movies Recommendation

More information

PTE : Predictive Text Embedding through Large-scale Heterogeneous Text Networks

PTE : Predictive Text Embedding through Large-scale Heterogeneous Text Networks PTE : Predictive Text Embedding through Large-scale Heterogeneous Text Networks Pramod Srinivasan CS591txt - Text Mining Seminar University of Illinois, Urbana-Champaign April 8, 2016 Pramod Srinivasan

More information

Machine Learning Final Project

Machine Learning Final Project Machine Learning Final Project Team: hahaha R01942054 林家蓉 R01942068 賴威昇 January 15, 2014 1 Introduction In this project, we are asked to solve a classification problem of Chinese characters. The training

More information

International ejournals

International ejournals Available online at www.internationalejournals.com International ejournals ISSN 0976 1411 International ejournal of Mathematics and Engineering 112 (2011) 1023-1029 ANALYZING THE REQUIREMENTS FOR TEXT

More information

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics Helmut Berger and Dieter Merkl 2 Faculty of Information Technology, University of Technology, Sydney, NSW, Australia hberger@it.uts.edu.au

More information

HIERARCHICAL MULTILABEL CLASSIFICATION

HIERARCHICAL MULTILABEL CLASSIFICATION HIERARCHICAL MULTILABEL CLASSIFICATION Pooja D. Bankar 1 S.S.Sane 2 1 M.E. Student 2 HOD & Vice Principal 1,2 Department of Computer Engineering 1,2 K. K. W. I. E. E. R., Nashik Savitribai Phule Pune University,

More information

Automatic Labeling of Issues on Github A Machine learning Approach

Automatic Labeling of Issues on Github A Machine learning Approach Automatic Labeling of Issues on Github A Machine learning Approach Arun Kalyanasundaram December 15, 2014 ABSTRACT Companies spend hundreds of billions in software maintenance every year. Managing and

More information

Final Report: Keyword extraction from text

Final Report: Keyword extraction from text Final Report: Keyword extraction from text Gowtham Rangarajan R and Sainyam Galhotra December 2015 Abstract We have devised a model for tagging stack overflow questions with keywords which help classify

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 22, 2016 Course Information Website: http://www.stat.ucdavis.edu/~chohsieh/teaching/ ECS289G_Fall2016/main.html My office: Mathematical Sciences

More information

A Semi-Supervised Approach for Web Spam Detection using Combinatorial Feature-Fusion

A Semi-Supervised Approach for Web Spam Detection using Combinatorial Feature-Fusion A Semi-Supervised Approach for Web Spam Detection using Combinatorial Feature-Fusion Ye Tian, Gary M. Weiss, Qiang Ma Department of Computer and Information Science Fordham University 441 East Fordham

More information

Clustering using Topic Models

Clustering using Topic Models Clustering using Topic Models Compiled by Sujatha Das, Cornelia Caragea Credits for slides: Blei, Allan, Arms, Manning, Rai, Lund, Noble, Page. Clustering Partition unlabeled examples into disjoint subsets

More information

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

The Comparative Study of Machine Learning Algorithms in Text Data Classification* The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Sources Hastie, Tibshirani, Friedman: The Elements of Statistical Learning James, Witten, Hastie, Tibshirani: An Introduction to Statistical Learning Andrew Ng:

More information

Event: PASS SQL Saturday - DC 2018 Presenter: Jon Tupitza, CTO Architect

Event: PASS SQL Saturday - DC 2018 Presenter: Jon Tupitza, CTO Architect Event: PASS SQL Saturday - DC 2018 Presenter: Jon Tupitza, CTO Architect BEOP.CTO.TP4 Owner: OCTO Revision: 0001 Approved by: JAT Effective: 08/30/2018 Buchanan & Edwards Proprietary: Printed copies of

More information

Influence of Word Normalization on Text Classification

Influence of Word Normalization on Text Classification Influence of Word Normalization on Text Classification Michal Toman a, Roman Tesar a and Karel Jezek a a University of West Bohemia, Faculty of Applied Sciences, Plzen, Czech Republic In this paper we

More information

CSE 158. Web Mining and Recommender Systems. Midterm recap

CSE 158. Web Mining and Recommender Systems. Midterm recap CSE 158 Web Mining and Recommender Systems Midterm recap Midterm on Wednesday! 5:10 pm 6:10 pm Closed book but I ll provide a similar level of basic info as in the last page of previous midterms CSE 158

More information

Learning the Structures of Online Asynchronous Conversations

Learning the Structures of Online Asynchronous Conversations Learning the Structures of Online Asynchronous Conversations Jun Chen, Chaokun Wang, Heran Lin, Weiping Wang, Zhipeng Cai, Jianmin Wang. Tsinghua University Chinese Academy of Science Georgia State University

More information

Computer Vision. Exercise Session 10 Image Categorization

Computer Vision. Exercise Session 10 Image Categorization Computer Vision Exercise Session 10 Image Categorization Object Categorization Task Description Given a small number of training images of a category, recognize a-priori unknown instances of that category

More information

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Improving the Efficiency of Fast Using Semantic Similarity Algorithm International Journal of Scientific and Research Publications, Volume 4, Issue 1, January 2014 1 Improving the Efficiency of Fast Using Semantic Similarity Algorithm D.KARTHIKA 1, S. DIVAKAR 2 Final year

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval Skiing Seminar Information Retrieval 2010/2011 Introduction to Information Retrieval Prof. Ulrich Müller-Funk, MScIS Andreas Baumgart and Kay Hildebrand Agenda 1 Boolean

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction 2 Machine Learning Field of study that gives computers the ability to learn without

More information

CS294-1 Final Project. Algorithms Comparison

CS294-1 Final Project. Algorithms Comparison CS294-1 Final Project Algorithms Comparison Deep Learning Neural Network AdaBoost Random Forest Prepared By: Shuang Bi (24094630) Wenchang Zhang (24094623) 2013-05-15 1 INTRODUCTION In this project, we

More information

Harp-DAAL for High Performance Big Data Computing

Harp-DAAL for High Performance Big Data Computing Harp-DAAL for High Performance Big Data Computing Large-scale data analytics is revolutionizing many business and scientific domains. Easy-touse scalable parallel techniques are necessary to process big

More information

A Hybrid Neural Model for Type Classification of Entity Mentions

A Hybrid Neural Model for Type Classification of Entity Mentions A Hybrid Neural Model for Type Classification of Entity Mentions Motivation Types group entities to categories Entity types are important for various NLP tasks Our task: predict an entity mention s type

More information

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017 CPSC 340: Machine Learning and Data Mining Probabilistic Classification Fall 2017 Admin Assignment 0 is due tonight: you should be almost done. 1 late day to hand it in Monday, 2 late days for Wednesday.

More information

FastText. Jon Koss, Abhishek Jindal

FastText. Jon Koss, Abhishek Jindal FastText Jon Koss, Abhishek Jindal FastText FastText is on par with state-of-the-art deep learning classifiers in terms of accuracy But it is way faster: FastText can train on more than one billion words

More information

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors Dejan Sarka Anomaly Detection Sponsors About me SQL Server MVP (17 years) and MCT (20 years) 25 years working with SQL Server Authoring 16 th book Authoring many courses, articles Agenda Introduction Simple

More information

CS435 Introduction to Big Data Spring 2018 Colorado State University. 3/21/2018 Week 10-B Sangmi Lee Pallickara. FAQs. Collaborative filtering

CS435 Introduction to Big Data Spring 2018 Colorado State University. 3/21/2018 Week 10-B Sangmi Lee Pallickara. FAQs. Collaborative filtering W10.B.0.0 CS435 Introduction to Big Data W10.B.1 FAQs Term project 5:00PM March 29, 2018 PA2 Recitation: Friday PART 1. LARGE SCALE DATA AALYTICS 4. RECOMMEDATIO SYSTEMS 5. EVALUATIO AD VALIDATIO TECHIQUES

More information

Business Club. Decision Trees

Business Club. Decision Trees Business Club Decision Trees Business Club Analytics Team December 2017 Index 1. Motivation- A Case Study 2. The Trees a. What is a decision tree b. Representation 3. Regression v/s Classification 4. Building

More information

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data Nian Zhang and Lara Thompson Department of Electrical and Computer Engineering, University

More information

Visualization and text mining of patent and non-patent data

Visualization and text mining of patent and non-patent data of patent and non-patent data Anton Heijs Information Solutions Delft, The Netherlands http://www.treparel.com/ ICIC conference, Nice, France, 2008 Outline Introduction Applications on patent and non-patent

More information

Vector Semantics. Dense Vectors

Vector Semantics. Dense Vectors Vector Semantics Dense Vectors Sparse versus dense vectors PPMI vectors are long (length V = 20,000 to 50,000) sparse (most elements are zero) Alternative: learn vectors which are short (length 200-1000)

More information

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li Learning to Match Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li 1. Introduction The main tasks in many applications can be formalized as matching between heterogeneous objects, including search, recommendation,

More information

Automatic Summarization

Automatic Summarization Automatic Summarization CS 769 Guest Lecture Andrew B. Goldberg goldberg@cs.wisc.edu Department of Computer Sciences University of Wisconsin, Madison February 22, 2008 Andrew B. Goldberg (CS Dept) Summarization

More information

Chapter 3: Supervised Learning

Chapter 3: Supervised Learning Chapter 3: Supervised Learning Road Map Basic concepts Evaluation of classifiers Classification using association rules Naïve Bayesian classification Naïve Bayes for text classification Summary 2 An example

More information

MINING OPERATIONAL DATA FOR IMPROVING GSM NETWORK PERFORMANCE

MINING OPERATIONAL DATA FOR IMPROVING GSM NETWORK PERFORMANCE MINING OPERATIONAL DATA FOR IMPROVING GSM NETWORK PERFORMANCE Antonio Leong, Simon Fong Department of Electrical and Electronic Engineering University of Macau, Macau Edison Lai Radio Planning Networks

More information

CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING

CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING Amol Jagtap ME Computer Engineering, AISSMS COE Pune, India Email: 1 amol.jagtap55@gmail.com Abstract Machine learning is a scientific discipline

More information

Artificial Intelligence. Programming Styles

Artificial Intelligence. Programming Styles Artificial Intelligence Intro to Machine Learning Programming Styles Standard CS: Explicitly program computer to do something Early AI: Derive a problem description (state) and use general algorithms to

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 08: Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu October 24, 2017 Learnt Prediction and Classification Methods Vector Data

More information

A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems

A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems Anestis Gkanogiannis and Theodore Kalamboukis Department of Informatics Athens University

More information

Movie Recommender System - Hybrid Filtering Approach

Movie Recommender System - Hybrid Filtering Approach Chapter 7 Movie Recommender System - Hybrid Filtering Approach Recommender System can be built using approaches like: (i) Collaborative Filtering (ii) Content Based Filtering and (iii) Hybrid Filtering.

More information

Lecture 25: Review I

Lecture 25: Review I Lecture 25: Review I Reading: Up to chapter 5 in ISLR. STATS 202: Data mining and analysis Jonathan Taylor 1 / 18 Unsupervised learning In unsupervised learning, all the variables are on equal standing,

More information

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search. An introduction Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods

More information

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation Learning 4 Supervised Learning 4 Unsupervised Learning 4

More information

RSDC 09: Tag Recommendation Using Keywords and Association Rules

RSDC 09: Tag Recommendation Using Keywords and Association Rules RSDC 09: Tag Recommendation Using Keywords and Association Rules Jian Wang, Liangjie Hong and Brian D. Davison Department of Computer Science and Engineering Lehigh University, Bethlehem, PA 18015 USA

More information

Slides for Data Mining by I. H. Witten and E. Frank

Slides for Data Mining by I. H. Witten and E. Frank Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error-

More information

Concept-Based Document Similarity Based on Suffix Tree Document

Concept-Based Document Similarity Based on Suffix Tree Document Concept-Based Document Similarity Based on Suffix Tree Document *P.Perumal Sri Ramakrishna Engineering College Associate Professor Department of CSE, Coimbatore perumalsrec@gmail.com R. Nedunchezhian Sri

More information

Deep Learning With Noise

Deep Learning With Noise Deep Learning With Noise Yixin Luo Computer Science Department Carnegie Mellon University yixinluo@cs.cmu.edu Fan Yang Department of Mathematical Sciences Carnegie Mellon University fanyang1@andrew.cmu.edu

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 5 Issue 4, Jul Aug 2017

International Journal of Computer Science Trends and Technology (IJCST) Volume 5 Issue 4, Jul Aug 2017 International Journal of Computer Science Trends and Technology (IJCST) Volume 5 Issue 4, Jul Aug 17 RESEARCH ARTICLE OPEN ACCESS Classifying Brain Dataset Using Classification Based Association Rules

More information

Automated Online News Classification with Personalization

Automated Online News Classification with Personalization Automated Online News Classification with Personalization Chee-Hong Chan Aixin Sun Ee-Peng Lim Center for Advanced Information Systems, Nanyang Technological University Nanyang Avenue, Singapore, 639798

More information

An Empirical Study of Lazy Multilabel Classification Algorithms

An Empirical Study of Lazy Multilabel Classification Algorithms An Empirical Study of Lazy Multilabel Classification Algorithms E. Spyromitros and G. Tsoumakas and I. Vlahavas Department of Informatics, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece

More information

Multimodal topic model for texts and images utilizing their embeddings

Multimodal topic model for texts and images utilizing their embeddings Multimodal topic model for texts and images utilizing their embeddings Nikolay Smelik, smelik@rain.ifmo.ru Andrey Filchenkov, afilchenkov@corp.ifmo.ru Computer Technologies Lab IDP-16. Barcelona, Spain,

More information

International Journal of Advanced Research in Computer Science and Software Engineering

International Journal of Advanced Research in Computer Science and Software Engineering Volume 3, Issue 3, March 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue:

More information

Information-Theoretic Feature Selection Algorithms for Text Classification

Information-Theoretic Feature Selection Algorithms for Text Classification Proceedings of International Joint Conference on Neural Networks, Montreal, Canada, July 31 - August 4, 5 Information-Theoretic Feature Selection Algorithms for Text Classification Jana Novovičová Institute

More information

Implementation of a High-Performance Distributed Web Crawler and Big Data Applications with Husky

Implementation of a High-Performance Distributed Web Crawler and Big Data Applications with Husky Implementation of a High-Performance Distributed Web Crawler and Big Data Applications with Husky The Chinese University of Hong Kong Abstract Husky is a distributed computing system, achieving outstanding

More information

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques 24 Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Ruxandra PETRE

More information

Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation

Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation I.Ceema *1, M.Kavitha *2, G.Renukadevi *3, G.sripriya *4, S. RajeshKumar #5 * Assistant Professor, Bon Secourse College

More information

1 Document Classification [60 points]

1 Document Classification [60 points] CIS519: Applied Machine Learning Spring 2018 Homework 4 Handed Out: April 3 rd, 2018 Due: April 14 th, 2018, 11:59 PM 1 Document Classification [60 points] In this problem, you will implement several text

More information

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda 1 Observe novel applicability of DL techniques in Big Data Analytics. Applications of DL techniques for common Big Data Analytics problems. Semantic indexing

More information

Powered Outer Probabilistic Clustering

Powered Outer Probabilistic Clustering Proceedings of the World Congress on Engineering and Computer Science 217 Vol I WCECS 217, October 2-27, 217, San Francisco, USA Powered Outer Probabilistic Clustering Peter Taraba Abstract Clustering

More information

Efficient Voting Prediction for Pairwise Multilabel Classification

Efficient Voting Prediction for Pairwise Multilabel Classification Efficient Voting Prediction for Pairwise Multilabel Classification Eneldo Loza Mencía, Sang-Hyeun Park and Johannes Fürnkranz TU-Darmstadt - Knowledge Engineering Group Hochschulstr. 10 - Darmstadt - Germany

More information

Semantic Website Clustering

Semantic Website Clustering Semantic Website Clustering I-Hsuan Yang, Yu-tsun Huang, Yen-Ling Huang 1. Abstract We propose a new approach to cluster the web pages. Utilizing an iterative reinforced algorithm, the model extracts semantic

More information

DI TRANSFORM. The regressive analyses. identify relationships

DI TRANSFORM. The regressive analyses. identify relationships July 2, 2015 DI TRANSFORM MVstats TM Algorithm Overview Summary The DI Transform Multivariate Statistics (MVstats TM ) package includes five algorithm options that operate on most types of geologic, geophysical,

More information

An Improvement of Centroid-Based Classification Algorithm for Text Classification

An Improvement of Centroid-Based Classification Algorithm for Text Classification An Improvement of Centroid-Based Classification Algorithm for Text Classification Zehra Cataltepe, Eser Aygun Istanbul Technical Un. Computer Engineering Dept. Ayazaga, Sariyer, Istanbul, Turkey cataltepe@itu.edu.tr,

More information

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation Data Mining Part 2. Data Understanding and Preparation 2.4 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Normalization Attribute Construction Aggregation Attribute Subset Selection Discretization

More information

TISA Methodology Threat Intelligence Scoring and Analysis

TISA Methodology Threat Intelligence Scoring and Analysis TISA Methodology Threat Intelligence Scoring and Analysis Contents Introduction 2 Defining the Problem 2 The Use of Machine Learning for Intelligence Analysis 3 TISA Text Analysis and Feature Extraction

More information

Learning to Recognize Faces in Realistic Conditions

Learning to Recognize Faces in Realistic Conditions 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

The use of frequent itemsets extracted from textual documents for the classification task

The use of frequent itemsets extracted from textual documents for the classification task The use of frequent itemsets extracted from textual documents for the classification task Rafael G. Rossi and Solange O. Rezende Mathematical and Computer Sciences Institute - ICMC University of São Paulo

More information

Plagiarism Detection Using FP-Growth Algorithm

Plagiarism Detection Using FP-Growth Algorithm Northeastern University NLP Project Report Plagiarism Detection Using FP-Growth Algorithm Varun Nandu (nandu.v@husky.neu.edu) Suraj Nair (nair.sur@husky.neu.edu) Supervised by Dr. Lu Wang December 10,

More information

CIS UDEL Working Notes on ImageCLEF 2015: Compound figure detection task

CIS UDEL Working Notes on ImageCLEF 2015: Compound figure detection task CIS UDEL Working Notes on ImageCLEF 2015: Compound figure detection task Xiaolong Wang, Xiangying Jiang, Abhishek Kolagunda, Hagit Shatkay and Chandra Kambhamettu Department of Computer and Information

More information

An Empirical Study on Lazy Multilabel Classification Algorithms

An Empirical Study on Lazy Multilabel Classification Algorithms An Empirical Study on Lazy Multilabel Classification Algorithms Eleftherios Spyromitros, Grigorios Tsoumakas and Ioannis Vlahavas Machine Learning & Knowledge Discovery Group Department of Informatics

More information

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Outline Introduction Data pre-treatment 1. Normalization 2. Centering,

More information

Multimedia Information Systems

Multimedia Information Systems Multimedia Information Systems Samson Cheung EE 639, Fall 2004 Lecture 6: Text Information Retrieval 1 Digital Video Library Meta-Data Meta-Data Similarity Similarity Search Search Analog Video Archive

More information

Birkbeck (University of London)

Birkbeck (University of London) Birkbeck (University of London) MSc Examination for Internal Students Department of Computer Science and Information Systems Information Retrieval and Organisation (COIY64H7) Credit Value: 5 Date of Examination:

More information

Using Machine Learning for Classification of Cancer Cells

Using Machine Learning for Classification of Cancer Cells Using Machine Learning for Classification of Cancer Cells Camille Biscarrat University of California, Berkeley I Introduction Cell screening is a commonly used technique in the development of new drugs.

More information

Optimizing the Hyperparameter of Feature Extraction and Machine Learning Classification Algorithms

Optimizing the Hyperparameter of Feature Extraction and Machine Learning Classification Algorithms Optimizing the Hyperparameter of Feature Extraction and Machine Learning Classification Algorithms Sani Muhammad Isa 1, Rizaldi Suwandi 2, Yosefina Pricilia Andrean 3 Computer Science Department, BINUS

More information

Allstate Insurance Claims Severity: A Machine Learning Approach

Allstate Insurance Claims Severity: A Machine Learning Approach Allstate Insurance Claims Severity: A Machine Learning Approach Rajeeva Gaur SUNet ID: rajeevag Jeff Pickelman SUNet ID: pattern Hongyi Wang SUNet ID: hongyiw I. INTRODUCTION The insurance industry has

More information