LSHTC: A Benchmark for Large-Scale Classification

Size: px

Start display at page:

Download "LSHTC: A Benchmark for Large-Scale Classification"

Christopher Reynolds
6 years ago
Views:

1 LSHTC: A Benchmark for Large-Scale Classification I. Partalas, A. Kosmopoulos,, G. Paliouras, E. Gaussier, I. Androutsopoulos, T. Artières, P. Gallinari, Massih-Reza Amini, Nicolas Baskiotis Lab. d Informatique de Grenoble & Grenoble University, France National Center for Scientific Research Demokritos, Greece Athens University of Economics and Business, Greece Lab. d informatique de Paris 6, France ICML 2014, Beijing 25th June, 2014 ICML The LSHTC Challenge series 1 / 19

2 The LSHTC Challenge Introduction A series of competitions on large-scale hierarchical classification First edition started in editions were organized till 2014 Participation of around 200 teams in total Goal: provide a standard benchmark for large-scale classification ICML The LSHTC Challenge series 2 / 19

3 Motivation Introduction Many systems use a structured organization of the data Taxonomy #doc. #catg. #editors DMOZ 5 M 1 M 100 k Wikipedia 32 M 600 k 129 k PubMed/MeSH 20 M 27 k - IPC 32 M 70 k - Root Arts Sports Movies Video Tennis Soccer Players Fun ICML The LSHTC Challenge series 3 / 19

4 Introduction The Problems in Large Scale Classification (1/2) Large number of parameters #cat #stems #docs cat/doc max path DMOZ 27, , ,992 1,02 5 Wiki Small 36, , ,148 1,86 10 Wiki Large 325,056 1,617,899 2,817, For DMOZ: 27, ,158 = 16,508,680,039 parameters should be learned 123Gb of memory ICML The LSHTC Challenge series 4 / 19

5 Introduction The Problems in Large Scale Classification (2/2) A lot of rare classes Distribution of documents in DMOZ nb of categories with n i >n α = category size n ICML The LSHTC Challenge series 5 / 19

6 Data format The Competitions Text from Web pages Pre-processing Stemming/lemmatization Stop-word removal Extraction of term-frequencies LibSVM format : : : : : ,45, : : : : : ICML The LSHTC Challenge series 6 / 19

7 LSHTC1 ( ) Data source: ODP Web directory Tracks: Basic, Cheap, Expensive, Full Single-label data Hierarchy type: tree Max number of categories: 12,000 Training/Test docs: 93,805/34,880 ICML The LSHTC Challenge series 7 / 19

8 LSHTC2 ( ) Data sources: ODP Web directory and Wikipedia Multi-label data Hierarchy type: tree and DAG Tracks (classes/documents/features) DMOZ: 27K/394K/594K Wikipedia small: 32K/456K/346K Wikipedia large: 325K/2.3M/1.6M (hierarchy is a graph) ICML The LSHTC Challenge series 8 / 19

9 Statistics on the Datasets #cat #stems #docs cat/doc max path DMOZ 27, , ,992 1,02 5 Wiki Small 36, , ,148 1,86 10 Wiki Large 325,056 1,617,899 2,817, ICML The LSHTC Challenge series 9 / 19

10 LSHTC3 ( ) Data sources: ODP Web directory and Wikipedia Large Scale Hierarchical Classification: Wikipedia small and large from LSHTC2 Multi-task Learning: DMOZ and Wikipedia datasets Refinement Learning: Expand hierarchy by splitting the nodes (based on DMOZ) ICML The LSHTC Challenge series 10 / 19

11 Track 1: Large Scale Hierarchical Classification 2 versions of Wikipedia dataset Task 1: medium-size (36,500 classes) Original text data Pre-processed data Task 2: large Wikipedia (325,000 classes) Multi-label and hierarchy is DAG ICML The LSHTC Challenge series 11 / 19

12 Track 2: Multi-task learning Wikipedia and DMOZ datasets Common feature space 12,000 classes for each dataset Single-label, hierarchy: DAG for Wikipedia and tree for DMOZ Goal: Leverage shared information in order to improve performance on each task Test sets are provided for both datasets. DMOZ Wikipedia shared info ICML The LSHTC Challenge series 12 / 19

13 Track 3: Refinement Learning Task 1: semi-supervised A reduced (12,000 classes) and an expanded (14,000 classes) hierarchy is available Two documents are given for each expanded class Goal: to reassign test documents to the new classes Root Books Music Comics Poetry Rock Jazz Funky Fusion ICML The LSHTC Challenge series 13 / 19

14 Track 3: Refinement Learning Task 1: semi-supervised A reduced (12,000 classes) and an expanded (14,000 classes) hierarchy is available Two documents are given for each expanded class Goal: to reassign test documents to the new classes Root Books Music Comics Poetry Rock Jazz Funky Root Books Music Fusion Task 2: unsupervised Only the reduced hierarchy is given Goal: expand the hierarchy Rock Jazz ICML The LSHTC Challenge series 13 / 19

15 Track 3: Refinement Learning Task 1: semi-supervised A reduced (12,000 classes) and an expanded (14,000 classes) hierarchy is available Two documents are given for each expanded class Goal: to reassign test documents to the new classes Root Books Music Comics Poetry Rock Jazz Funky Root Books Music Fusion Task 2: unsupervised Only the reduced hierarchy is given Goal: expand the hierarchy Rock Jazz ICML The LSHTC Challenge series 13 / 19

16 LSHTC The Competitions LSHTC4 organized within Kaggle One track: Very Large Scale Supervised Learning Based on the Wikipedia large dataset (325K classes) ICML The LSHTC Challenge series 14 / 19

17 More Info The Competitions Challenge site: You can download data and use the oracles for testing your systems In the near future all datasets will be released with ground truth answers Goal Establish standard benchmark datasets for large-scale classification ICML The LSHTC Challenge series 15 / 19

18 BioASQ ( A challenge on large-scale semantic indexing and question answering European FP7 project: Special Support Action Two tracks: 1 Large-Scale Semantic Indexing of Biomedical Documents 2 Question Answering for Biomedicine ICML The LSHTC Challenge series 16 / 19

19 Large-Scale Semantic Indexing (1/3) Classify articles from PubMed (biomedical journals) Test is performed in batches 5,000 articles per test set ICML The LSHTC Challenge series 17 / 19

20 Large-Scale Semantic Indexing (2/3) Field name Description JSON type pmid PubMed identifier string year Year of publication string journal Journal of publication string abstracttext Full text of the abstract string title Title of the article string meshmajor Mesh labels of the article array of strings Table : Description of the properties of the data for Task1a. ICML The LSHTC Challenge series 18 / 19

21 Large-Scale Semantic Indexing (3/3) Task 1a Task 2a Task 2a (sample) Articles 10,876,004 12,628,968 4,458,300 Unique labels 26,563 26,831 26,631 Labels per article Size in GB Table : Basic statistics about the training data for Task 1a and Task 2a. ICML The LSHTC Challenge series 19 / 19

Classification in a large number of categories

Classification in a large number of categories T. Artières, Joint work with the MLIA team at LIP6, the AMA team at LIG and Demokritos Lab. d informatique de Paris 6, France Lab. d Informatique de Grenoble