Feature LDA: a Supervised Topic Model for Automatic Detection of Web API Documentations from the Web

Similar documents
Feature LDA: A Supervised Topic Model for Automatic Detection of Web API Documentations from the Web

A Measurement Design for the Comparison of Expert Usability Evaluation and Mobile App User Reviews

Part I: Data Mining Foundations

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Harnessing the Crowds for Automating the Identification of Web APIs

Liangjie Hong*, Dawei Yin*, Jian Guo, Brian D. Davison*

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Using Semantic Similarity in Crawling-based Web Application Testing. (National Taiwan Univ.)

LDA for Big Data - Outline

CS281 Section 9: Graph Models and Practical MCMC

A Deep Relevance Matching Model for Ad-hoc Retrieval

Applications of admixture models

Automatically Extracting Polarity-Bearing Topics for Cross-Domain Sentiment Classification

Spatial Latent Dirichlet Allocation

Pre-Requisites: CS2510. NU Core Designations: AD

Multi-label classification using rule-based classifier systems

Tag-based Social Interest Discovery

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Recommendation of APIs for Mashup creation

Data Mining Course Overview

Scalable Object Classification using Range Images

jldadmm: A Java package for the LDA and DMM topic models

Metadata Topic Harmonization and Semantic Search for Linked-Data-Driven Geoportals -- A Case Study Using ArcGIS Online

Lecture 25: Review I

Chapter 27 Introduction to Information Retrieval and Web Search

Classication of Corporate and Public Text

Search Engines. Information Retrieval in Practice

One-Shot Learning with a Hierarchical Nonparametric Bayesian Model

Hebei University of Technology A Text-Mining-based Patent Analysis in Product Innovative Process

Collective Intelligence in Action

Classifying Images with Visual/Textual Cues. By Steven Kappes and Yan Cao

An Unsupervised Approach for Discovering Relevant Tutorial Fragments for APIs

Application of rough ensemble classifier to web services categorization and focused crawling

PTE : Predictive Text Embedding through Large-scale Heterogeneous Text Networks

Scalable Bayes Clustering for Outlier Detection Under Informative Sampling

Knowledge-based Word Sense Disambiguation using Topic Models Devendra Singh Chaplot

Text Document Clustering Using DPM with Concept and Feature Analysis

A Navigation-log based Web Mining Application to Profile the Interests of Users Accessing the Web of Bidasoa Turismo

Classification. 1 o Semestre 2007/2008

Lecture 21 : A Hybrid: Deep Learning and Graphical Models

Probabilistic Dialogue Models with Prior Domain Knowledge

Studying the Impact of Text Summarization on Contextual Advertising

LEARNING TO GENERATE CHAIRS WITH CONVOLUTIONAL NEURAL NETWORKS

Ontology based Model and Procedure Creation for Topic Analysis in Chinese Language

Warped Mixture Models

Event: PASS SQL Saturday - DC 2018 Presenter: Jon Tupitza, CTO Architect

Estimating Labels from Label Proportions

K Nearest Neighbor Wrap Up K- Means Clustering. Slides adapted from Prof. Carpuat

Sampling Table Configulations for the Hierarchical Poisson-Dirichlet Process

Improving Recognition through Object Sub-categorization

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

Parallelism for LDA Yang Ruan, Changsi An

Hierarchical Document Clustering

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

Randomized Algorithms for Fast Bayesian Hierarchical Clustering

Introduction to Information Retrieval

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems

Clustering Relational Data using the Infinite Relational Model

COMS 4771 Clustering. Nakul Verma

An Approach To Web Content Mining

A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

Supervised Learning for Image Segmentation

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

Semantic Estimation for Texts in Software Engineering

SCALABLE KNOWLEDGE BASED AGGREGATION OF COLLECTIVE BEHAVIOR

Knowledge Discovery and Data Mining 1 (VO) ( )

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s

3D RECONSTRUCTION OF BRAIN TISSUE

Link Prediction for Social Network

Using PageRank in Feature Selection

Collective classification in network data

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

A REALM-BASED QUESTION ANSWERING SYSTEM USING PROBABILISTIC MODELING

Web Mining TEAM 8. Professor Anita Wasilewska CSE 634 Data Mining

Using PageRank in Feature Selection

Classifying Bug Reports to Bugs and Other Requests Using Topic Modeling

Science 2.0 VU Processing Science 2.0 Data, Content Mining

A Neuro Probabilistic Language Model Bengio et. al. 2003

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

Outline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity

Applied Bayesian Nonparametrics 5. Spatial Models via Gaussian Processes, not MRFs Tutorial at CVPR 2012 Erik Sudderth Brown University

Multi-Label Classification with Conditional Tree-structured Bayesian Networks

PlantSimLab An Innovative Web Application Tool for Plant Biologists

Context Change and Versatile Models in Machine Learning

D B M G Data Base and Data Mining Group of Politecnico di Torino

Statistical Machine Learning for Text Mining with Markov Chain Monte Carlo Inference

Using Machine Learning to Optimize Storage Systems

QMiner is a data analytics platform for processing large-scale real-time streams containing structured and unstructured data.

Visualization and text mining of patent and non-patent data

How SPICE Language Modeling Works

Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers

Text Analytics (Text Mining)

Automatic Labeling of Issues on Github A Machine learning Approach

Predicting Messaging Response Time in a Long Distance Relationship

Automatically Constructing a Directory of Molecular Biology Databases

LSHTC: A Benchmark for Large-Scale Classification

A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING

BUPT at TREC 2009: Entity Track

Classifying Twitter Data in Multiple Classes Based On Sentiment Class Labels

Transcription:

Feature LDA: a Supervised Topic Model for Automatic Detection of Web API Documentations from the Web Chenghua Lin, Yulan He, Carlos Pedrinaci, and John Domingue Knowledge Media Institute, The Open University Milton Keynes, United Kingdom 11 th International Semantic Web Conference. Boston, USA, 2012

Outline Introduction The Feature LDA model Data Experimental Results Conclusion

Introduction What are Web APIs? New Web Services based on a simple stack of technologies, URL+HTTP+XML/JSON Known as RESTful service when conforming the REST principles Why Web APIs? Light technology stack VS. classical Web services (WSDL, SOAP, WS-*) Enable easy access and aggregation of collection of resources Widely used and reused

Finding a Web API Dedicated registries, e.g. ProgrammableWeb Contain out of date or incorrect information, e.g. invalid pages or incorrect links to APIs documentation pages Only a limited number of Web APIs listed, left out a large number of third party Web APIs General search engine, e.g. Google Not optimized for Web API discovery Mix up with pages that are not (so) relevant, e.g. blogs and advertisement about Web APIs.

Motivating Example Suppose we use Google to search sentiment analysis APIs

Our Goal Goal: To build a customized search engine for detecting third party Web APIs on the Web scale Assume every Web API provides public documentation page(s) These pages provide the most relevant information for developers Approached as a binary classification problem, i.e. distinguishing API documentation VS. normal pages

Easy Task? Issues No simple way to effectively and uniquely identify Web APIs described in plain and unstructured HTML highly heterogeneous in format and contents, i.e. NO Gold standard People hardly follow even there is!!! More than 99% of pages on the Web are NOT relevant to Web API Need a high precision classifier yet maintaining good accuracy

Our Approach The Feature LDA model

Latent Dirichlet Allocation (LDA) LDA: the simplest form of topic models gene A fully unsupervised Bayesian model dna Assumes that documents exhibit multiple topics cell (also known as theme or gist ) sequence genetics Each topic is a distribution over words which have mapping tight semantic relation with one another human molecular

Graphical Model Dirichlet parameter Topic assignment Dirichlet parameter Per-document Topic proportions Observed word Topic distributions Intuition: Each document exhibit multiple topics Each topic is a distribution over words Each word is drawn from one of those topics

????? A Colorful Example Generating a document with a LDA model Generate a document with a bulk of words What we know: the model parameters and distributions, i.e. (α, β, Θ, Φ) What we don t know: the WORDS of each document, which needs to be generated Per-word topic assignment Per-doc topic proportion Topics: gene 0.04 dna 0.02 cell 0.01 life 0.02 evolve 0.01 organism 0.01 brain 0.04 neuron 0.02 nerve 0.01 d a t a 0. 04 number 0.04 computer 0.04

In a real-world setting Inverse the entire process, to estimate the LDA model parameters given observation, i.e. a collection of documents What we know: the WORDS of each document, which we can observe What we don t know: the model parameters and distributions, i.e. (α, β, Θ, Φ) Topics:????

Applications Classification LDA Posterior inference Information retrieval Data Recommendation Data exploration

Feature LDA Feature LDA model: a generic probabilistic framework for text classification. A supervised four-layer hierarchical Bayesian model Accommodate supervisions from both labelled instance and labelled features for training Able to extract meaningful class specific topics Labelled features In baseball vs. hockey text classification pitcher baseball, puck hockey learned automatically from training data using any feature selection method, e.g. Info Gain

Dirichlet parameter Per-doc class label specific topic proportions Feature LDA Graphical Model Topic assignment Observed word Labelled features Dirichlet parameter Document class labels Per-doc class proportions Class label assignment Class label specific topic distributions

Generative Process For each document d Draw π d ~Dir(γ ε d ) For each class label k, draw θ d,k ~Dir(α k ) For each word w in d Draw a class label l i ~ Mult(π d ) Draw a topic z i ~ Mult(Θ d,li ) Draw a word w i ~ Mult(Φ li,zi )

Inference Collapse Gibbs sampling for model posterior estimation Approximating model parameters

Data and Setup The API dataset: 1,547 Web pages crawled from the API Home URLs of ProgrammableWeb (manually labelled, training/testing split: 80%-20%) 622 pages are API documentations 925 pages are normal Web pages Preprocessing Extract content from HTMLs by discarding tags and java scripts that are not relevant to classification remove wildcards, non-alphanumeric characters and stop-words, followed by Porter stemming. Setup Class label k = 2 Topic number T=1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20 29,000 labelled features (Info Gain)

Experimental Results We report fealda classification results on the API dataset with different model settings: Training with labelled instances Training with both labelled instances and labelled features fealda performance feature selection on labeled features fealda vs. baselines (NB, SVM, MaxEnt) and other supervised topic models (labellda, plda) Topic extraction

Training with labelled Instances fealda = 80.5% T = 2 MaxEnt = 79.3%

Labelled instances + labelled features fealda = 81.8% T = 3 MaxEnt = 79.3%

Feature Selection fealda = 82.7% 3.4% > MaxEnt fealda classification accuracy vs. different feature class probability threshold τ.

Overall comparison fealda vs. the state-of-the-art models outperforms three strong supervised baselines better than labeledlda and plda for more than 3% in accuracy gives very high precision: essential for reducing false positives when mining from the Web

Topic Extraction Topics with true API label Terms are fairly technical, e.g. json, statu, paramet, element, valu, request and string, etc. Topics with false API label Terms are less technical and more diverse, e.g. contact, support, custom, flight, school

Conclusions Discovering Web APIs is becoming increasingly important and existing support is not optimal Treat Web API discovery as a classification problem Presented a supervised topic model called fealda offers a generic framework for text classification Capable to encode supervision from both labelled instance and labelled features Offers very high precision which is crucial for reducing false positive when mining from the Web Able to extract class label specific topics

Questions? Email: chenghua.lin@open.ac.uk 26