Bagging & System Combination for POS Tagging. Dan Jinguji Joshua T. Minor Ping Yu

Similar documents
3. Data Analysis and Statistics

STA 4273H: Statistical Machine Learning

Artificial Intelligence. Programming Styles

MDM 4UI: Unit 8 Day 2: Regression and Correlation

Chapter 4: Analyzing Bivariate Data with Fathom

U4L4B Box Problem - TI Nspire CAS Teacher Notes

Performance of Latent Growth Curve Models with Binary Variables

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

Naïve Bayes for text classification

1. Solve the following system of equations below. What does the solution represent? 5x + 2y = 10 3x + 5y = 2

Supplemental Material: Multi-Class Open Set Recognition Using Probability of Inclusion

Predictive Analysis: Evaluation and Experimentation. Heejun Kim

Multiple Regression White paper

Predicting housing price

Distribution-free Predictive Approaches

Leveling Up as a Data Scientist. ds/2014/10/level-up-ds.jpg

Model combination. Resampling techniques p.1/34

Example 1 of panel data : Data for 6 airlines (groups) over 15 years (time periods) Example 1

/4 Directions: Graph the functions, then answer the following question.

Active Query Sensing for Mobile Location Search

CS229 Final Project: Predicting Expected Response Times

Curve Fitting with Linear Models

Reviewer Profiling Using Sparse Matrix Regression

Lecture 9: Support Vector Machines

PSY 9556B (Jan8) Design Issues and Missing Data Continued Examples of Simulations for Projects

Bivariate (Simple) Regression Analysis

CHAPTER 3 AN OVERVIEW OF DESIGN OF EXPERIMENTS AND RESPONSE SURFACE METHODOLOGY

Pavement Surface Microtexture: Testing, Characterization and Frictional Interpretation

This is a function because no vertical line can be drawn so that it intersects the graph more than once.

Chapter 7: Linear regression

visualizing q uantitative quantitative information information

Predicting Expenditure Per Person for Cities

CDAA No. 4 - Part Two - Multiple Regression - Initial Data Screening

D-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview

SUMMER PACKET Answer Key

Exploring Econometric Model Selection Using Sensitivity Analysis

Why Use Graphs? Test Grade. Time Sleeping (Hrs) Time Sleeping (Hrs) Test Grade

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

Supervised vs unsupervised clustering

STATS PAD USER MANUAL

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

Fast or furious? - User analysis of SF Express Inc

8. Tree-based approaches

WHITE PAPER Application Performance Management. Managing the Performance of DB2 Environments

CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp

More Specific Announcements in BGP. Geoff Huston APNIC

Using HLM for Presenting Meta Analysis Results. R, C, Gardner Department of Psychology

7. Collinearity and Model Selection

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Model-based Recursive Partitioning

CHAPTER 1 INTRODUCTION

Chapter 1 Polynomials and Modeling

Lab Activity #2- Statistics and Graphing

CS 229 Final Project Report Learning to Decode Cognitive States of Rat using Functional Magnetic Resonance Imaging Time Series

Arizona Academic Standards

Analysing Ferret XML Reports to Estimate the Density of Copied Code

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

Middle School Math Course 3 Correlation of the ALEKS course Middle School Math 3 to the Illinois Assessment Framework for Grade 8

PO 2. Identify irrational numbers. SE/TE: 4-8: Exploring Square Roots and Irrational Numbers, TECH: itext; PH Presentation Pro CD-ROM;

Mapping of Hierarchical Activation in the Visual Cortex Suman Chakravartula, Denise Jones, Guillaume Leseur CS229 Final Project Report. Autumn 2008.

5. Key ingredients for programming a random walk

Frequently Asked Questions Updated 2006 (TRIM version 3.51) PREPARING DATA & RUNNING TRIM

The x-intercept can be found by setting y = 0 and solving for x: 16 3, 0

Range Imaging Through Triangulation. Range Imaging Through Triangulation. Range Imaging Through Triangulation. Range Imaging Through Triangulation

CPSC 340: Machine Learning and Data Mining. Logistic Regression Fall 2016

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati

Comparing internal and ideal point preference mapping solutions. Wakeling, Hasted and Macfie

Study Guide. Module 1. Key Terms

To make sense of data, you can start by answering the following questions:

Practice Test (page 391) 1. For each line, count squares on the grid to determine the rise and the run. Use slope = rise

Exercise: Graphing and Least Squares Fitting in Quattro Pro

Absolute Calibration Correction Coefficients of GOES Imager Visible Channel: DCC Reference Reflectance with Aqua MODIS C6 Data

Detection and Extraction of Events from s

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display

Ensemble Learning. Another approach is to leverage the algorithms we have via ensemble methods

Lab 3: From Data to Models

Find the Relationship: An Exercise in Graphical Analysis

SB 463 IGC ALGEBRA I Adapted from Houston ISD Curriculum

Feature selection. LING 572 Fei Xia

Instructor: Barry McQuarrie Page 1 of 6

Parallel and Distributed Computing Programming Assignment 1

Statistics Lab #7 ANOVA Part 2 & ANCOVA

INVESTIGATIONS OF CROSS-CORRELATION AND EUCLIDEAN DISTANCE TARGET MATCHING TECHNIQUES IN THE MPEF ENVIRONMENT. Greg And Ken Holmlund # ABSTRACT

Linear Regression in two variables (2-3 students per group)

Interpretable Machine Learning with Applications to Banking

Excel 2010 with XLSTAT

STA 570 Spring Lecture 5 Tuesday, Feb 1

Chemistry 1A Graphing Tutorial CSUS Department of Chemistry

Sentiment analysis under temporal shift

Section 1.1 The Distance and Midpoint Formulas

April 3, 2012 T.C. Havens

Middle School Math Course 3

How to use FSBForecast Excel add-in for regression analysis (July 2012 version)

Multicollinearity and Validation CIVL 7012/8012

Motivation: Shortcomings of Hidden Markov Model. Ko, Youngjoong. Solution: Maximum Entropy Markov Model (MEMM)

Problems 1 and 5 were graded by Amin Sorkhei, Problems 2 and 3 by Johannes Verwijnen and Problem 4 by Jyrki Kivinen. Entropy(D) = Gini(D) = 1

Part 2. Reviewing and Interpreting Similarity Reports

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools

SLStats.notebook. January 12, Statistics:

Minitab Study Card J ENNIFER L EWIS P RIESTLEY, PH.D.

Transcription:

Bagging & System Combination for POS Tagging Dan Jinguji Joshua T. Minor Ping Yu

Bagging Bagging can gain substantially in accuracy The vital element is the instability of the learning algorithm Bagging slightly degrades the performance of stable algorithm

Bagging Results In all three learning algorithms, the performance of one bag is slightly worse than the baseline Bagging has different effects on the three baseline learning algorithms when 10 bags were used

Bagging Error Reduction Bagging Effectiveness Over Baseline Systems 6.898 TBL 5.547 6.030 6.132 Systems MaxEnt 8.675 14.467 10K 5K 1K -2.780 Trigram-3.974-2.225-5.000 0.000 5.000 10.000 15.000 20.000 % Error Reduction

Bagging and Stability Bagging had the greatest effect on MaxEnt Bagging actually had a negative effect on trigrams Therefore we could say MaxEnt is the least stable of the ML algorithms tested

System Combination Two basic methods (work on any number of inputs): Random Tag Choose one of the input tags at random Simple Voting Count each input tag, output highest count tag Ties go to the last tag seen Weighted Voting (three base systems only): Train the ML systems on 80% of the training data Use these as confidence scores for voting Basically like regular voting with default to best system (TBL)

Combination Effectiveness Combination System Effectiveness Over Single System Average 21.670 Weighted 22.969 29.547 Combination Technique Voting 20.744 22.055 26.946 10K 5K 1K -1.338 Random -0.555-0.299-5.000 0.000 5.000 10.000 15.000 20.000 25.000 30.000 35.000 % Error Reduction

Hybrid Bagging Bags from all three systems combined into one pool. Using regular voting Ideally would have used weighted voting Best performance of all: 1K 5K 10K Accuracy 92.6568 95.2401 95.8752 Error Reduction Over Voting 12.63 7.45 6.67 Error Reduction Over Average 36.17! 27.86 26.03

Overall Improvement Comparison Combined Bagged over Average Combined Bagged over Voting Bagged TBL Systems Bagged MaxEnt Bagged Trigram Weighted Voting 10K 5K 1K Random -5.00 0.00 5.00 10.00 15.00 20.00 25.00 30.00 35.00 40.00 Random Voting Weighted Bagged Trigram % Error Reduction Bagged MaxEnt Bagged TBL Combined Bagged over Voting Combined Bagged over Average 10K -1.34 20.74 21.67-2.78 6.13 6.90 6.67 26.03 5K -0.55 22.06 22.97-3.97 8.68 5.55 7.45 27.86 1K -0.30 26.95 29.55-2.22 14.47 6.03 12.63 36.17

Overall Results 98 96 94 92 Accuracy 90 88 86 84 82 Baselines Trigram MaxEnt TBL Average System Type Bagged Systems Trigram MaxEnt TBL Combinations Random Voting Weighted Combined Bagged 40K 10K 5K Training Data Set 1K

Baseline Data 1K 5K 10K 40K Trigram 85.680 92.116 93.438 95.359 MaxEnt 88.304 93.548 94.629 96.342 TBL 91.639 94.566 95.225 96.349

Data Trends Increase in accuracy as the size of the training data increases Not a linear function Perhaps the change is proportional to the relative change in the size of the training data.

Accuracy vs. Log (Training Size)

Error vs. Log (Training Size)

Log (Error) vs. Log (Training Size)

Does This Make Sense? We consider the proportional increase in the size of the training data: log (training data size) As that increases, for example, as it doubles we see a proportional decrease in the percentage error: log (100 accuracy)

Does This Trend Continue? Some additional data points Additional training data: 100, 250, 500, 2K, 4K, 7K, 20K.

Log (Error) vs. Log (Training Size)

Some Analysis Linear Least Squares Regression Y-Intercepts: Trigram: 2.1348 MaxEnt: 2.1647 TBL: 1.7467 Slope Values: Trigram: 0.32732 MaxEnt: 0.35931 TBL: 0.26656

Correlation Coefficients Trigram: 0.995885 MaxEnt: 0.995944 TBL: 0.992777!!!

Potential Interpretations Y-Intercepts: Trigram: 2.1348 MaxEnt: 2.1647 TBL: 1.7467 Slope Values: Trigram: 0.32732 MaxEnt: 0.35931 TBL: 0.26656 Y-Intercept Meaning Potentially a measure of robustness Slope Meaning Potentially a measure of trainability Responsiveness of the ML algorithm to data size Coefficient of Training Efficiency?

Caveats It may be we are in a sweet spot for these algorithms. However, this relationship does seem to hold for a broad range of practical values for training data sizes: 100 sentences 40K sentences

Conclusions Bagging is effective for some algorithms and not others. System combination is an moderately effective way to maximize accuracy, especially if the ML algorithms involved model the data in different ways. Bagging and then combining systems is a good way to maximize accuracy but has major runtime drawbacks, such as computation time and system complexity. Doing this experiment allowed us to get some interesting quantitative comparison measures of three common ML algorithms. It is hard to say if these measures are generalizable.