ClearTK Tutorial. Steven Bethard. Mon 11 Jun University of Colorado Boulder

Size: px

Start display at page:

Download "ClearTK Tutorial. Steven Bethard. Mon 11 Jun University of Colorado Boulder"

Archibald Richardson
5 years ago
Views:

1 ClearTK Tutorial Steven Bethard University of Colorado Boulder Mon 11 Jun 2012

2 What is ClearTK? Framework for machine learning in UIMA components Feature extraction from CAS Common classifier interface across: OpenNLP Maxent, Mallet, GRMM, libsvm, SVMlight Training and loading classifiers from JARs UIMA wrappers for non-uima components Berkeley parser ClearParser MaltParser Stanford CoreNLP In-house machine learning components, e.g. for TimeML Steven Bethard University of Colorado 2/12

3 What is ClearTK? Framework for machine learning in UIMA components Feature extraction from CAS Common classifier interface across: OpenNLP Maxent, Mallet, GRMM, libsvm, SVMlight Training and loading classifiers from JARs UIMA wrappers for non-uima components Berkeley parser ClearParser MaltParser Stanford CoreNLP In-house machine learning components, e.g. for TimeML Steven Bethard University of Colorado 2/12

4 ClearTK Machine Learning Pipeline Train pipeline DataWriter Training Data Feature Extraction Trainer Steven Bethard University of Colorado 3/12

5 ClearTK Machine Learning Pipeline Train pipeline DataWriter Training Data Predict pipeline Feature Extraction Classifier Trainer Steven Bethard University of Colorado 3/12

6 ClearTK Machine Learning Pipeline Train pipeline DataWriter Training Data Predict pipeline Feature Extraction Classifier Trainer Steven Bethard University of Colorado 4/12

7 Simple Feature Extraction All code examples using ClearTK trunk, r3912, May 10, 2012 public void process(jcas jcas) { // extract an annotation from the CAS Token token = // create some features from it List<Feature> features = new ArrayList<Feature>(); // create a feature from the text in the CAS int length = token.getcoveredtext().length(); features.add(new Feature( length, length)); // create a feature from an annotation feature in the CAS String pos = token.getpartofspeech(); features.add(new Feature( pos, pos)); } Steven Bethard University of Colorado 5/12

8 Feature Extractor Library public void process(jcas jcas) { // unicode patterns, e.g. Az0 > LuLlNd extractor = new CharacterCategoryPatternExtractor(); // or features by navigating the CAS type system extractor = new TypePathExtractor(Token.class, dep/head/pos ); // or features from the surrounding context extractor = new ContextExtractor(Token.class, new CoveredTextExtractor(), new Preceding(2), new Ngram(new Following(3))); // apply the extractor to an annotation List<Feature> features = extractor.extract(annotation); } Steven Bethard University of Colorado 6/12

9 ClearTK Machine Learning Pipeline Train pipeline DataWriter Training Data Predict pipeline Feature Extraction Classifier Trainer Steven Bethard University of Colorado 7/12

10 Classifier Annotators (CleartkAnnotator) public void process(jcas jcas) { // extract features List<Feature> features = // during training, create instances from CAS if (this.istraining()) { String outcome = // e.g. token.getpos() this.datawriter.write(new Instance(outcome, features)); } // during prediction, create CAS annotations else { String outcome = this.classifier.classify(features); // e.g. token.setpos(outcome) } } Steven Bethard University of Colorado 8/12

11 ClearTK Machine Learning Pipeline Train pipeline DataWriter Training Data Predict pipeline Feature Extraction Classifier Trainer Steven Bethard University of Colorado 9/12

12 Train Pipeline with CleartkAnnotator // create UIMA descriptor for train pipeline AnalysisEngineFactory.createPrimitiveDescription( MyCleartkAnnotator.class, // specify type of classifier to write data for DefaultDataWriterFactory.PARAM DATA WRITER CLASS NAME, MultiClassLIBSVMDataWriter.class.getName(), // specify output directory for training data DirectoryDataWriterFactory.PARAM OUTPUT DIRECTORY, dir.getpath()); // run UIMA train pipeline // train classifier and package into a jar file JarClassifierBuilder.trainAndPackage(dir); Steven Bethard University of Colorado 10/12

13 Predict Pipeline with CleartkAnnotator // create UIMA descriptor for predict pipeline AnalysisEngineFactory.createPrimitiveDescription( MyCleartkAnnotator.class, // specify where to load the classifier model from GenericJarClassifierFactory.PARAM CLASSIFIER JAR PATH, new File(dir, model.jar )); // run UIMA predict pipeline Steven Bethard University of Colorado 11/12

14 Summary ClearTK machine learning framework: Feature extraction from CAS Common classifier interface to many libraries Training and loading classifiers from JARs Many more features not covered in this talk, including: Sequence tagging (e.g. CRFs, Viterbi over k-best) Chunking (e.g. BIO tagging) Evaluation (e.g. cross-validation with UIMA pipelines) Regression and ranking (via SVM-light) Baselines (e.g. most frequent value, mean value) Steven Bethard University of Colorado 12/12

ClearTK 2.0: Design Patterns for Machine Learning in UIMA

ClearTK 2.0: Design Patterns for Machine Learning in UIMA Steven Bethard 1, Philip Ogren 2, Lee Becker 2 1 University of Alabama at Birmingham, Birmingham, AL, USA 2 University of Colorado at Boulder,