Supervised classification of law area in the legal domain

AFSTUDEERPROJECT BSC KI Supervised classification of law area in the legal domain Author: Mees FRÖBERG (10559949) Supervisors: Evangelos KANOULAS Tjerk DE GREEF June 24, 2016

Abstract Search algorithms have been implemented to make online legal data more accessible. These could be improved by the use of meta-data (e.g. law area), although it is often missing. Text classification algorithms have been used to automatically retrieve this meta-data, even though the performance of these is not high enough to be of practical use. Therefore closer cooperation between legal and machine learning experts is suggested. In this research the performance of flat and hierarchical classification algorithms and the effects of different kinds of features is compared. The results indicate that the overall performance of the hierarchical classification algorithm is best, although the train and test time of the algorithms is significantly larger. Term frequency models also perform slightly better compared to other feature extraction methods as LDA and word vectors, even though they take more time to test and train. 1

Contents 1 Introduction 3 2 Related Literature 3 3 Method 4 3.1 Classification Algorithms................................ 4 3.1.1 Flat classification................................ 5 3.1.2 Hierarchical classification........................... 5 3.2 Features......................................... 6 3.2.1 Term Frequency................................ 6 3.2.2 Term Frequency-Inverse Document Frequency................ 7 3.2.3 Word Vector.................................. 7 3.2.4 Latent Dirichlet Allocation........................... 8 3.3 Evaluation........................................ 8 4 Results 9 5 Conclusion 11 6 Appendix A 12 2

1 Introduction The Internet contains an enormous number of legal documents which is continuously growing. Search algorithms have been implemented to make this data more accessible. These could be improved by the use of meta-data (e.g. law area) attached to the legal documents. Some documents contain meta-data, although most lack it. Manually labeling these documents is time-consuming because of the enormous and increasing number of legal documents. Thus text classification algorithms, which take considerably less time to label documents, are implemented to automatically classify legal documents. Previous research has shown that Multilabel classification algorithms can be used to classify legal documents (Mencía & Fürnkranz, n.d.). Usually meta-data is of a hierarchical structure (as is law area). E.L. Mencía and J. Fürnkranz, however, make a flat prediction, hence not using the hierarchical structure of the meta-data. This leads to the following question: How do hierarchical classification algorithms compare to flat classification algorithms with respect to the legal domain in terms of performance? Different features can be extracted from the documents, which function as input for the classification algorithms. These have a substantial influence on the performance of classification models (arguably the most influence). Therefore the second question that will be addressed is: What kinds of features should be used to enhance the performance of classification models? The overall question that will be answered is: How do hierarchical algorithms compare to flat classification algorithms with respect to the legal domain in terms of performance and what kinds of features enhance this performance? A public dataset of approximately 160.000 dutch labeled legal documents have been used to answer this question. This dataset is skewed, which means that there is more information about certain categories compared to others. Consequently this will influence the performance of the classification algorithms. The next section (Section 2) describes some related research that has been used to answer this question. The research method is explained in Section 3 and subsequently the results are shown and examined in Sections 4 and 5. 2 Related Literature Research has shown that the performance of classification algorithms for legal documents is not high enough to be of practical use (Governatori, 2009). Governatori suggested that closer cooperation between legal and machine learning experts is needed to improve this performance. Thus demonstrating the need for this research. Previous research in text classification of legal documents indicated that multilabel classification algorithms (flat classification) can be used to classify legal documents (Mencía & Fürnkranz, n.d.). This is applied on the legal domain and is related to the classification of law area. Consequently this research could also apply to the classification into law area. The multilabel algorithm that is used by E.L. Mencía and J. Fürnkranz is based on a support vector machine. Therefore the flat classification algorithm used in this research will also be based on support vector machines. Fan et al. present a linear support vector machine that is efficient for training on large-scale problems (Fan et al., 2008). This classifier will be used, since the dataset is large and some features could present sparsity problems. 3

Other research purported that hierarchical classification methods can be used to classify documents into pre-defined topic hierarchies (Cai & Hofmann, 2004). Since most meta-data (including law area) has a hierarchical structure, such algorithms could be successful in classifying law area. Therefore it could perform better in the legal domain compared to a flat classification algorithm. W. Bi and J.T. Kwok recently presented a novel hierarchical multilabel classification algorithm that can be used on both tree- and DAG-structured hierarchies (Bi & Kwok, 2011). They claim that it does not suffer from the problem of insufficient/skewed training data in classifier training. This is the primary reason that this algorithm will be used and tested, since the available dataset is skewed. There are numerous different features that can be extracted from text documents and used by classification algorithms. Research has demonstrated that Latent Dirichlet Allocation (LDA) has been successful in finding topics of legal documents (Raghuveer, 2012). Hoffman et al. developed an efficient Variational Bayes algorithm for LDA (Hoffman et al., 2010) which will be used for this research. Other research has shown that word vectors can represent the meaning of a word with high accuracy (Mikolov et al., 2013). Previously mentioned features will be extracted from the available dataset and the effect of these features on the performance of classification models will be measured and compared with each other and with bag of words implementations. 3 Method The overall approach to answer the research question is adapting a flat classification algorithm and hierarchical classification algorithm to this problem. Once those are implemented they can be trained and tested with different features. After these experiments are conducted the results can be evaluated by comparing the different performances of the classification algorithms and features with each other. 3.1 Classification Algorithms The overall approach of every classification algorithm is the same. It requires some training data {(x i,y i )} n i=1. Where x i is a vector representing a data point (document) and y i the associated class with that data point. Once the algorithm has used the training data to adapt its parameters to the data it can be used to classify an unseen data point into a class. It is a multilabel classification problem if there are more than two classes. In this research two different multilabel classification algorithms are compared. More specifically a flat multilabel classification algorithm will be compared with a hierarchical multilabel classification algorithm. The main difference between them is that a flat classification algorithm does not use the structure of the labels that it classifies in, while the hierarchical one does. Subsequently this should present a difference in performance. Even though both algorithms are different they should also have some similarities to be able to conduct this research. They should be able to classify into multiple labels, for example a legal document could contain information about multiple law areas. This is often the case when observing legal documents Furthermore the performance measurement of both algorithms should be equivalent. These conditions are necessary to be able to compare both algorithms. How the performance is measured and the algorithms are compared is explained in the Evaluation section (3.3). 4

3.1.1 Flat classification The flat classification algorithm, used in this research, exerts a one-vs-all strategy. This strategy trains a classifier for every class. For each classifier, the class is fitted against all the other classes. Thus if there are d different classes, d binary classifiers are trained. Given a data point, these classifiers can calculate a confidence score for every class. Once all the confidences are calculated the classification algorithm selects the classes with high confidence scores. Subsequently the algorithm will be able to classify into multiple classes. Any classification algorithm that can calculate such a confidence score is applicable to the ove-vs-all method. For this research linear support vector machines are used in combination with the one-vs-all strategy to make a multilabel prediction. E.L. Mencía and J. Fürnkranz successfully used support vector machines in their research (Mencía & Fürnkranz, n.d.). Fan et al. describe this classification algorithm in their paper (Fan et al., 2008). The basic idea of this classifier is that it creates a hyperplane (or multiple hyperplanes) that is used to separate the data from each other. The distance of a data point to this hyperplane can be used to calculate a confidence score. Subsequently this confidence score can be used by the one-vs-all method to classify the data point. Figure 1: An example hierarchy, the black nodes are leaves of the hierarchy. When classifying into the leaves the classes higher up in the hierarchy can also be obtained. Given this algorithm data points can be classified into multiple classes, even though this does not necessarily imply that it is possible to classify into a pre-defined hierarchy. This can be achieved by classifying into the leaves of the hierarchy (see Figure 1). Subsequently the labels of all data points have to contain a leaf. Since this is not the case with law area the flat multilabel classification algorithm, used in this research, will classify on all the different classes in the hierarchy. 3.1.2 Hierarchical classification Hierarchical classification algorithms use the predefined structure of the classes (hierarchy). W. Bi and J.T. Kwok recently implemented such an algorithm (Bi & Kwok, 2011). Their algorithm requires a L value that represents the number of labels that have to be assigned to the data point. For this research L is set to 2, since most legal documents are labeled with two law areas. is similar to the flat classification algorithm in that a linear support vector machine is trained for each class, however this algorithm uses PCA to reduces the number of classifiers that have to be trained. Consequently reducing the negative effects of skewed data, because all the the classes 5

in the projected space can learn from the whole training set. Once these classifiers predicted their confidence values the resulting matrix is projected back to its former class space. ((a)) Initialized supernodes ((b)) After one iteration ((c)) After two iterations Figure 2: Two iterations of CSSA. The node in the hierarchy is black if its ψ is equal to 1. The blue bubbles around nodes in the hierarchy represent supernodes. The value within the node is the confidence score of the class. After these confidence scores are calculated a Condense Sort and Select Algorithm (CSSA) is used to classify the data point into the predefined hierarchy. If there are d different classes in the hierarchy, {ψ i } d i=1, where ψ i {0,1}, represents the list of assigned labels to the data point. CSSA initiates ψ 0 1 (root of the hierarchy) and all nodes as supernodes. A supernode is a list of nodes. It is assigned a supernode value (SNV) which is the average of the confidence values over all its constituent nodes (Bi & Kwok, 2011). CSSA sorts all the supernodes on SNV and the supernode with the highest SNV is selected. If the ψ value of the parent of the supernode is equal to 1 the ψ values of all nodes in the supernode are assigned to 1, else the supernode is condensed with its parent supernode (see Figure 2). This proces continues until the number of assigned nodes is greater than or equal to L. 3.2 Features The previously described classification algorithms require vectors, that represent the data points. The available dataset comprises legal documents, therefore every data point is a document. There are multiple techniques to extract these vectors from documents. The next subsections describe different features that are used in this research. Once those features are extracted from the data they can be compared (further explained in Section 3.3). 3.2.1 Term Frequency A primitive approach of quantifying a document into a vector is retrieving the Term Frequency (TF) of the words (terms) in the document. This is also known as Bag of Words (BOW) models. Let k be the number of distinct terms in a corpus. A BOW model can calculate a k-dimensional vector that represents the given document. Every dimension in this vector represents a term, the value of that dimension is equal to the number of times the term occurs in the document. Terms that have not been seen by the BOW model are discarded. Consequently the BOW model should cover most (if not all) of the terms in the document to prevent information loss. However, note that a term does not necessarily have to be a word, it could also be a sequence of words. While creating a BOW model the size of this sequence of words has to be defined. If this 6

size is equal to one the terms are called unigrams. This research only covers unigrams, even though bigrams and trigrams could be used as well. For a more extensive research towards the effects of different n-grams on document classification into law area could be explored as well. 3.2.2 Term Frequency-Inverse Document Frequency Some words in dutch text do not carry a lot of meaning (e.g. de, het, een). Even though TF gives these terms a very high value. Therefore some adjustment were made to give meaningful terms a higher value compared to terms with little meaning. This resulted in Term Frequency-Inverse Document Frequency (TF-IDF). This is achieved by extracting the term counts and dividing them with the document frequency of that term. The document frequency of a term is the number of documents where the term occurs. Dutch words like de and het occur in almost all documents. Subsequently the value of this term decreases significantly. Consequently TF-IDF should give a more precise representation of a document compared to TF. 3.2.3 Word Vector Besides term frequency, there are other approaches of quantifying a document into a vector space. Mikolov et al. proposed two new methods that efficiently estimate the meaning of words in vector space (Mikolov et al., 2013). One of those methods is the Continuous Bag of Words model (CBOW), which is used in this research. It uses the context of a word to estimate the the meaning of it. Figure 3: Overview of the Continuous Bag of Words model CBOW contains three layers (see Figure 3). In the input layer 1-of-V codings of the surrounding words are created. The projection layer projects them into the same position. Subsequently the output layer adds these projections to each other, resulting in the word vector. The projection layer can project the words into different sizes, this could influence the classification process. Therefore different word vector sizes will be tested and compared during this research. There are 32 leaves in 7

the law area hierarchy, consequently the lowest vector size that will be evaluated is 32. The other sizes are 50, 100 and 200. 3.2.4 Latent Dirichlet Allocation The final feature that will be analysed in this research is Latent Dirichlet Allocation (LDA). This method assumes that every document can be seen as a collection of multiple topics. Hoffman et al. developed a Variational Bayes (VB) for LDA (Hoffman et al., 2010). Given K different topics and a set of documents a LDA model can be trained. For every distinct word in the corpus K topic weights are estimated using the training set. Subsequently these are used to calculate probabilities for each topic of a given document. These are put in a vector that can be used by classification algorithms. Similar to CBOW, the dimensionality of LDA depends on the given number of different topics (K). Different K s will be evaluated. Similar to word vectors the lowest K that will be trained is 32. Furthermore LDA models with a K of 50, 100 and 200 will be trained and tested. 3.3 Evaluation Once all the models and algorithms can be used they are ready for evaluation. This evaluation is divided into two parts. On one side there is the evaluation of the classification algorithms and on the other side is the evaluation of the features. Both evaluations use the same performance measurement. This includes the the following values: - Accuracy - Main accuracy - Sub accuracy - Precision (micro/macro) - Recall (micro/macro) - F1-score (micro/macro) - Training time - Testing time The hierarchy of law areas contains two layers, the main layer and the sub layer (see Figure 4). Main accuracy represents the accuracy on the law areas in the main layer and sub accuracy on the law areas in the sub layer. The difference between macro and micro metrics is that macro calculates the metrics for each label separately and takes the average of those values, while micro does this globally. Due to skewed data the macro metrics will probably be significantly lower compared to the micro metrics, since the algorithm receives less information about certain law areas. Consequently the performance on these law areas will probably be lower compared to the other law areas. All these measurements provide some insight into the performance of the classification algorithm. To prevent overfitting k-fold cross validation is applied. The whole dataset will be divided into 5 sets. Five models will be trained and tested for every algorithm and feature combination, although every time on another train and test set. Once these models are trained their performance is measured and the average of all five models is taken. This process is repeated for multiple train sizes. This will provide some insight into the effects of different train sizes on the performance of the algorithm. Even though the highest training set will probably perform best, this will indicate whether a very large training set is necessary to acquire a good performance. 8

Figure 4: Law area hierarchy Once all these performance values are obtained the differences between the flat and hierarchical classification algorithms will be compared per feature. This should provide some insight in the difference of flat and hierarchical classification algorithms with respect to the legal domain in terms of performance. The evaluation of the features is a bit more complicated, since the dimensionality of the features differ. Ideally the features would quantify every document into vectors of the same size. LDA and word vectors can adjust this and by applying principal component analysis on TF and TF-IDF this could be achieved, although this would cause information loss and therefore pessimal usage of the features. Consequently the best LDA and word vector models for the flat and hierarchical classification algorithms will be used for the evaluation of the features. The features will be compared for each classification separately (The results from the flat classification algorithm will be compared with one another idem for the hierarchical classification algorithm). 4 Results Several experiments were conducted during this research. Every algorithm and feature extraction method was implemented in python. All the measurements that were taken during this research are presented in Appendix A. The biggest train set that was used contained approximately 22.000 documents, which is 50% of the available train set. The other models were trained on train set sizes of 1%, 10%, 20%, 30% and 40%. While the size increases the performance and train/test time increase as well, even though this increase is significantly lower after train set sizes of 40% (see Appendix A). The macro metrics are for all the models significantly lower than the micro metrics. This is probably caused by the skewness of the data (see Table 6). The overall performance of the hierarchical classification algorithm is higher compared to the flat classification algorithm (see Appendix A). The accuracies, especially the sub accuracy, of the hierarchical classification algorithm are significantly higher, although the micro and macro metrics are similar. 9

The performance of the different word vector and LDA models on the flat and hierarchical classification algorithm were compared using models trained on 50% of the train set. Their differences are displayed in Figures 5 and 6. These graphs indicate that while the dimensionality increases the performance increases as well, even though LDA200 performs slightly better compared to LDA100 (see Table 6). The training time of LDA200 is the only value that is significantly higher compared to LDA100. Flat Hierarchical 1 1 0.5 0.5 32 50 100 200 0 Dimensionality 32 50 100 200 0 Dimensionality Figure 5: WORDVEC performance measurements. The yellow bars represent the accuracy values, the blue bars the micro F1-scores and the red the macro F1-scores Flat Hierarchical 1 1 0.5 0.5 32 50 100 200 0 Dimensionality 32 50 100 200 0 Dimensionality Figure 6: LDA performance measurements. The yellow bars represent the accuracy values, the blue bars the micro F1- scores and the red the macro F1-scores 10

5 Conclusion The results from this research suggest that hierarchical classification algorithms perform better compared to flat classification algorithms with respect to the legal domain, although they are more time consuming. Similarly Term Frequency-Inverse Document Frequency enhances these algorithms the most and takes more time to train and test compared to the other features. The difference between the word vector model and TF-IDF is not substantial except the train time. The word vector model does take less time to train. Even though the train time should not have a big influence on the performance of classification algorithms, since an algorithm only has to be trained once. Consequently this research indicates that TF-IDF enhances the performance of classification algorithms the most of the features addressed in this research. Even though it is hard to compare TF and TF-IDF with the other features, since the dimensionality differs. The vectors that are produced by TF-IDF are significantly larger compared to those of LDA or word vector models. Therefore these features have more room to express itself in. This research provides some insight into the classification of legal documents. However there are still gaps that could be explored. In future work other features could be evaluated, new features (e.g. Named Entities) could be explored or the features discussed in this research could be combined. It could be possible that a combination of different features performs even better. Furthermore different flat and hierarchical classification algorithms could be compared to amplify the claims made in this research, if the same results can be derived from the results. 11

6 Appendix A In the following tables all the results are displayed (Tables 1, 2, 3, 4, 5, 6). accuracy main accuracy sub accuracy micro precision Flat TF 0.780 0.954 0.783 0.928 0.882 0.904 0.300 0.261 0.267 0.454 0.373 TF-IDF 0.834 0.956 0.834 0.967 0.895 0.929 0.338 0.242 0.266 0.765 0.671 WORDVEC32 0.780 0.983 0.782 0.953 0.896 0.923 0.349 0.285 0.295 0.106 0.111 WORDVEC50 0.826 0.989 0.827 0.963 0.917 0.939 0.431 0.323 0.338 0.113 0.116 WORDVEC100 0.862 0.987 0.863 0.975 0.929 0.951 0.398 0.308 0.330 0.144 0.165 WORDVEC200 0.881 0.989 0.881 0.982 0.938 0.959 0.443 0.338 0.361 0.241 0.273 LDA32 0.886 0.988 0.886 0.988 0.937 0.962 0.393 0.312 0.334 0.135 0.160 LDA50 0.877 0.986 0.877 0.984 0.931 0.957 0.410 0.290 0.313 0.116 0.163 LDA100 0.859 0.974 0.859 0.980 0.917 0.947 0.379 0.280 0.303 0.134 0.195 LDA200 0.839 0.973 0.839 0.983 0.906 0.942 0.384 0.271 0.297 0.246 0.249 Hierarchical TF 0.867 0.957 0.867 0.912 0.912 0.912 0.337 0.284 0.290 7.084 3.705 TF-IDF 0.905 0.957 0.905 0.931 0.931 0.931 0.385 0.306 0.320 6.659 3.728 WORDVEC32 0.882 0.967 0.882 0.924 0.924 0.924 0.345 0.330 0.315 0.465 2.544 WORDVEC50 0.900 0.977 0.900 0.938 0.939 0.938 0.369 0.354 0.347 0.506 3.535 WORDVEC100 0.919 0.980 0.919 0.950 0.950 0.950 0.409 0.385 0.384 0.710 3.380 WORDVEC200 0.929 0.980 0.929 0.954 0.954 0.954 0.402 0.396 0.392 1.191 3.712 LDA32 0.936 0.982 0.936 0.959 0.959 0.959 0.406 0.388 0.386 0.193 2.522 LDA50 0.925 0.980 0.925 0.953 0.953 0.953 0.389 0.372 0.367 0.400 3.589 LDA100 0.919 0.973 0.919 0.946 0.946 0.946 0.412 0.361 0.364 0.706 3.892 LDA200 0.921 0.977 0.921 0.949 0.949 0.949 0.412 0.356 0.363 1.369 3.913 micro recall micro f1 macro precision macro recall macro f1 train time (s) test time (s) Table 1: Results of models trained on 1% of the whole train set 12

accuracy main accuracy sub accuracy micro precision Flat TF 0.894 0.991 0.895 0.968 0.954 0.961 0.468 0.429 0.435 4.955 0.512 TF-IDF 0.918 0.995 0.918 0.994 0.957 0.975 0.514 0.391 0.421 6.348 0.965 WORDVEC32 0.820 0.987 0.822 0.959 0.919 0.938 0.469 0.357 0.378 0.587 0.104 WORDVEC50 0.863 0.994 0.863 0.972 0.938 0.955 0.503 0.409 0.427 0.743 0.121 WORDVEC100 0.914 0.995 0.915 0.983 0.960 0.972 0.568 0.475 0.493 0.965 0.177 WORDVEC200 0.926 0.997 0.927 0.986 0.965 0.976 0.615 0.480 0.508 2.098 0.321 LDA32 0.919 0.995 0.919 0.989 0.958 0.973 0.439 0.381 0.397 0.580 0.188 LDA50 0.924 0.996 0.924 0.988 0.961 0.974 0.494 0.411 0.439 0.625 0.176 LDA100 0.921 0.993 0.921 0.990 0.958 0.973 0.511 0.409 0.435 1.082 0.225 LDA200 0.918 0.995 0.918 0.991 0.957 0.973 0.532 0.401 0.435 2.767 0.530 Hierarchical TF 0.938 0.988 0.938 0.963 0.963 0.963 0.433 0.452 0.434 68.850 3.738 TF-IDF 0.955 0.988 0.955 0.971 0.971 0.971 0.479 0.470 0.470 44.496 3.846 WORDVEC32 0.902 0.977 0.902 0.940 0.940 0.940 0.412 0.406 0.394 4.372 3.217 WORDVEC50 0.921 0.986 0.921 0.953 0.954 0.954 0.422 0.436 0.420 6.716 3.370 WORDVEC100 0.948 0.991 0.948 0.969 0.969 0.969 0.430 0.460 0.435 6.382 3.028 WORDVEC200 0.953 0.993 0.953 0.973 0.973 0.973 0.442 0.478 0.450 11.176 3.672 LDA32 0.950 0.990 0.950 0.970 0.970 0.970 0.436 0.459 0.439 2.224 3.749 LDA50 0.955 0.992 0.955 0.974 0.974 0.974 0.472 0.492 0.469 3.419 3.671 LDA100 0.954 0.992 0.954 0.973 0.973 0.973 0.466 0.482 0.467 6.995 3.923 LDA200 0.952 0.994 0.952 0.973 0.973 0.973 0.474 0.484 0.468 11.177 3.900 micro recall micro f1 macro precision macro recall macro f1 train time (s) test time (s) Table 2: Results of models trained on 10% of the whole train set accuracy main accuracy sub accuracy micro precision Flat TF 0.918 0.993 0.919 0.975 0.965 0.970 0.512 0.495 0.494 11.229 0.557 TF-IDF 0.935 0.997 0.935 0.995 0.967 0.980 0.612 0.436 0.472 10.345 0.849 WORDVEC32 0.819 0.988 0.821 0.962 0.919 0.940 0.493 0.379 0.401 1.561 0.121 WORDVEC50 0.867 0.994 0.868 0.973 0.941 0.957 0.511 0.415 0.437 2.132 0.142 WORDVEC100 0.916 0.997 0.916 0.984 0.962 0.973 0.597 0.494 0.520 2.264 0.168 WORDVEC200 0.936 0.998 0.937 0.987 0.971 0.979 0.611 0.526 0.541 4.872 0.360 LDA32 0.923 0.996 0.924 0.988 0.961 0.975 0.447 0.390 0.406 1.217 0.214 LDA50 0.928 0.996 0.928 0.988 0.963 0.975 0.521 0.434 0.460 1.394 0.186 LDA100 0.930 0.994 0.930 0.989 0.963 0.976 0.534 0.434 0.465 2.182 0.235 LDA200 0.926 0.996 0.926 0.991 0.962 0.976 0.556 0.435 0.468 3.203 0.309 Hierarchical TF 0.952 0.991 0.952 0.972 0.972 0.972 0.451 0.481 0.457 148.869 3.765 TF-IDF 0.965 0.993 0.965 0.979 0.979 0.979 0.488 0.505 0.491 67.103 3.557 WORDVEC32 0.906 0.979 0.906 0.943 0.943 0.943 0.412 0.412 0.401 11.517 2.911 WORDVEC50 0.927 0.989 0.927 0.958 0.958 0.958 0.429 0.450 0.430 15.973 3.171 WORDVEC100 0.951 0.994 0.951 0.973 0.973 0.973 0.469 0.507 0.481 18.328 3.527 WORDVEC200 0.959 0.996 0.959 0.978 0.978 0.978 0.465 0.508 0.477 32.261 3.697 LDA32 0.954 0.993 0.954 0.974 0.974 0.974 0.447 0.472 0.448 5.698 3.834 LDA50 0.954 0.994 0.954 0.974 0.974 0.974 0.473 0.497 0.473 8.773 3.738 LDA100 0.960 0.994 0.960 0.977 0.977 0.977 0.480 0.501 0.484 12.746 3.768 LDA200 0.956 0.995 0.956 0.975 0.975 0.975 0.471 0.505 0.477 24.096 5.103 micro recall micro f1 macro precision macro recall macro f1 train time (s) test time (s) Table 3: Results of models trained on 20% of the whole train set 13

accuracy main accuracy sub accuracy micro precision Flat TF 0.926 0.993 0.927 0.978 0.969 0.973 0.519 0.501 0.501 19.192 0.601 TF-IDF 0.943 0.997 0.944 0.995 0.971 0.983 0.615 0.461 0.496 13.526 0.793 WORDVEC32 0.824 0.988 0.826 0.963 0.920 0.941 0.495 0.381 0.398 2.708 0.111 WORDVEC50 0.872 0.995 0.873 0.974 0.944 0.959 0.547 0.447 0.468 3.763 0.141 WORDVEC100 0.918 0.996 0.919 0.984 0.964 0.974 0.616 0.517 0.534 3.542 0.169 WORDVEC200 0.942 0.998 0.942 0.988 0.974 0.981 0.618 0.514 0.543 7.926 0.343 LDA32 0.926 0.996 0.926 0.988 0.963 0.975 0.457 0.399 0.413 1.864 0.208 LDA50 0.931 0.997 0.931 0.988 0.965 0.976 0.514 0.439 0.464 2.098 0.171 LDA100 0.935 0.995 0.935 0.990 0.966 0.978 0.538 0.443 0.471 3.221 0.236 LDA200 0.932 0.997 0.932 0.991 0.965 0.978 0.582 0.452 0.486 6.446 0.478 Hierarchical TF 0.956 0.994 0.956 0.975 0.975 0.975 0.459 0.495 0.469 205.129 3.459 TF-IDF 0.966 0.995 0.966 0.981 0.981 0.981 0.493 0.506 0.492 92.846 3.225 WORDVEC32 0.906 0.978 0.906 0.942 0.942 0.942 0.420 0.421 0.405 17.181 2.451 WORDVEC50 0.931 0.990 0.931 0.960 0.960 0.960 0.443 0.472 0.447 25.030 2.997 WORDVEC100 0.952 0.995 0.952 0.973 0.973 0.973 0.463 0.496 0.471 26.004 3.059 WORDVEC200 0.960 0.997 0.960 0.978 0.978 0.978 0.463 0.511 0.476 55.989 3.670 LDA32 0.955 0.993 0.955 0.974 0.974 0.974 0.451 0.484 0.457 10.044 3.738 LDA50 0.956 0.995 0.956 0.976 0.976 0.976 0.474 0.507 0.481 14.189 3.922 LDA100 0.960 0.994 0.960 0.977 0.977 0.977 0.474 0.502 0.482 20.971 3.570 LDA200 0.958 0.995 0.958 0.976 0.976 0.976 0.474 0.509 0.481 43.170 5.397 micro recall micro f1 macro precision macro recall macro f1 train time (s) test time (s) Table 4: Results of models trained on 30% of the whole train set accuracy main accuracy sub accuracy micro precision Flat TF 0.931 0.994 0.933 0.979 0.972 0.975 0.517 0.523 0.509 25.066 0.586 TF-IDF 0.948 0.998 0.948 0.995 0.973 0.984 0.636 0.489 0.523 17.188 0.754 WORDVEC32 0.824 0.988 0.826 0.963 0.920 0.941 0.500 0.392 0.411 4.071 0.121 WORDVEC50 0.872 0.995 0.872 0.974 0.944 0.959 0.520 0.445 0.461 5.127 0.152 WORDVEC100 0.922 0.997 0.923 0.984 0.966 0.975 0.607 0.516 0.539 5.426 0.180 WORDVEC200 0.943 0.998 0.944 0.988 0.975 0.981 0.624 0.537 0.558 11.111 0.348 LDA32 0.927 0.996 0.927 0.989 0.964 0.976 0.467 0.407 0.418 2.061 0.180 LDA50 0.934 0.997 0.935 0.988 0.967 0.977 0.517 0.454 0.476 2.896 0.196 LDA100 0.937 0.996 0.938 0.990 0.968 0.979 0.566 0.463 0.491 4.165 0.231 LDA200 0.934 0.997 0.934 0.991 0.967 0.979 0.588 0.461 0.493 8.546 0.396 Hierarchical TF 0.958 0.994 0.958 0.976 0.976 0.976 0.459 0.505 0.470 269.048 3.442 TF-IDF 0.969 0.997 0.969 0.983 0.983 0.983 0.484 0.519 0.496 107.602 3.002 WORDVEC32 0.908 0.979 0.908 0.944 0.944 0.944 0.418 0.421 0.408 27.996 2.474 WORDVEC50 0.930 0.990 0.930 0.960 0.960 0.960 0.438 0.456 0.437 38.102 2.908 WORDVEC100 0.953 0.995 0.953 0.974 0.974 0.974 0.463 0.495 0.473 37.329 3.001 WORDVEC200 0.959 0.997 0.959 0.978 0.978 0.978 0.463 0.514 0.477 80.124 3.825 LDA32 0.956 0.994 0.956 0.975 0.975 0.975 0.449 0.480 0.455 13.451 3.686 LDA50 0.955 0.995 0.955 0.975 0.975 0.975 0.462 0.503 0.471 17.820 3.827 LDA100 0.960 0.995 0.960 0.978 0.978 0.978 0.473 0.504 0.481 25.724 3.590 LDA200 0.959 0.996 0.959 0.978 0.978 0.978 0.473 0.515 0.482 44.016 3.822 micro recall micro f1 macro precision macro recall macro f1 train time (s) test time (s) Table 5: Results of models trained on 40% of the whole train set 14

accuracy main accuracy sub accuracy micro precision Flat TF 0.935 0.995 0.936 0.981 0.973 0.977 0.555 0.536 0.535 30.384 0.595 TF-IDF 0.951 0.998 0.951 0.995 0.975 0.985 0.654 0.502 0.537 22.475 0.827 WORDVEC32 0.825 0.988 0.827 0.964 0.921 0.942 0.517 0.395 0.415 5.941 0.121 WORDVEC50 0.874 0.995 0.875 0.974 0.945 0.959 0.552 0.459 0.477 6.933 0.160 WORDVEC100 0.923 0.997 0.924 0.985 0.966 0.975 0.642 0.535 0.560 6.846 0.193 WORDVEC200 0.946 0.998 0.947 0.988 0.976 0.982 0.671 0.567 0.589 14.348 0.384 LDA32 0.928 0.996 0.928 0.989 0.964 0.976 0.478 0.405 0.419 2.658 0.175 LDA50 0.934 0.997 0.935 0.988 0.968 0.978 0.518 0.454 0.477 3.531 0.185 LDA100 0.940 0.996 0.940 0.990 0.969 0.979 0.561 0.464 0.489 8.172 0.393 LDA200 0.938 0.998 0.938 0.991 0.969 0.980 0.620 0.478 0.511 9.023 0.406 Hierarchical TF 0.958 0.995 0.958 0.977 0.977 0.977 0.453 0.499 0.464 1020.509 4.107 TF-IDF 0.970 0.997 0.970 0.983 0.983 0.983 0.481 0.516 0.493 152.613 3.244 WORDVEC32 0.907 0.979 0.907 0.943 0.943 0.943 0.412 0.411 0.399 40.441 2.452 WORDVEC50 0.929 0.990 0.929 0.959 0.959 0.959 0.441 0.469 0.446 52.206 2.937 WORDVEC100 0.952 0.995 0.952 0.974 0.974 0.974 0.458 0.507 0.469 46.781 3.028 WORDVEC200 0.963 0.997 0.963 0.980 0.980 0.980 0.468 0.521 0.484 104.081 3.755 LDA32 0.955 0.994 0.955 0.975 0.975 0.975 0.446 0.478 0.452 16.991 3.690 LDA50 0.957 0.996 0.957 0.976 0.976 0.976 0.469 0.511 0.479 23.656 3.873 LDA100 0.961 0.995 0.961 0.978 0.978 0.978 0.473 0.504 0.481 37.145 3.784 LDA200 0.961 0.996 0.961 0.979 0.979 0.979 0.476 0.518 0.484 132.038 3.887 micro recall micro f1 macro precision macro recall macro f1 train time (s) test time (s) Table 6: Results of models trained on 50% of the whole train set 15

References Bi, W., & Kwok, J. T. (2011). Multi-label classification on tree-and dag-structured hierarchies. In Proceedings of the 28th international conference on machine learning (icml-11) (pp. 17 24). Cai, L., & Hofmann, T. (2004). Hierarchical document categorization with support vector machines. In Proceedings of the thirteenth acm international conference on information and knowledge management (pp. 78 87). Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., & Lin, C.-J. (2008). Liblinear: A library for large linear classification. The Journal of Machine Learning Research, 9, 1871 1874. Governatori, G. (2009). Exploiting properties of legislative texts to improve classification accuracy. In Legal knowledge and information systems: Jurix 2009, the twenty-second annual conference (Vol. 205, p. 136). Hoffman, M., Bach, F. R., & Blei, D. M. (2010). Online learning for latent dirichlet allocation. In advances in neural information processing systems (pp. 856 864). Mencía, E. L., & Fürnkranz, J. (n.d.). Efficient multilabel classification algorithms for large-scale problems in the legal domain. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arxiv preprint arxiv:1301.3781. Raghuveer, K. (2012). Legal documents clustering using latent dirichlet allocation. IAES Int. J. Artif. Intell, 2(1), 34 37. 16