Good Cell, Bad Cell: Classification of Segmented Images for Suitable Quantification and Analysis

Similar documents
Machine Learning Implementation in live-cell tracking

Louis Fourrier Fabien Gaie Thomas Rolf

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Multi-voxel pattern analysis: Decoding Mental States from fmri Activity Patterns

Fraud Detection using Machine Learning

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Accelerometer Gesture Recognition

Weka ( )

CS229 Final Project: Predicting Expected Response Times

Applying Supervised Learning

Improving Positron Emission Tomography Imaging with Machine Learning David Fan-Chung Hsu CS 229 Fall

7 Techniques for Data Dimensionality Reduction

Evaluation of different biological data and computational classification methods for use in protein interaction prediction.

Classification of Subject Motion for Improved Reconstruction of Dynamic Magnetic Resonance Imaging

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

Evaluating Classifiers

Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga.

Analyzing Vocal Patterns to Determine Emotion Maisy Wieman, Andy Sun

Bagging for One-Class Learning

Principles of Machine Learning

An Experimental Multi-Objective Study of the SVM Model Selection problem

Automatic Domain Partitioning for Multi-Domain Learning

Feature Selection for fmri Classification

The latest trend of hybrid instrumentation

Learning to Recognize Faces in Realistic Conditions

Search Engines. Information Retrieval in Practice

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Equation to LaTeX. Abhinav Rastogi, Sevy Harris. I. Introduction. Segmentation.

Supplemental Material: Multi-Class Open Set Recognition Using Probability of Inclusion

Earthquake Waveform Recognition Olivia Grubert and Bridget Vuong Department of Computer Science, Stanford University

Tutorials Case studies

Facial Expression Classification with Random Filters Feature Extraction

Database and Knowledge-Base Systems: Data Mining. Martin Ester

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Prognosis of Lung Cancer Using Data Mining Techniques

An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm

4. Feedforward neural networks. 4.1 Feedforward neural network structure

SUPERVISED LEARNING METHODS. Stanley Liang, PhD Candidate, Lassonde School of Engineering, York University Helix Science Engagement Programs 2018

Data Analytics. Qualification Exam, May 18, am 12noon

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems

Automated Classification of Quilt Photographs Into Crazy and Noncrazy

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

Automatic Colorization of Grayscale Images

Missing Data Estimation in Microarrays Using Multi-Organism Approach

Spam Detection ECE 539 Fall 2013 Ethan Grefe. For Public Use

Python With Data Science

Network Traffic Measurements and Analysis

3 Feature Selection & Feature Extraction

Comparison of different preprocessing techniques and feature selection algorithms in cancer datasets

Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers

Identifying and Understanding Differential Transcriptor Binding

Classification Algorithms in Data Mining

3D RECONSTRUCTION OF BRAIN TISSUE

Table Of Contents: xix Foreword to Second Edition

Machine Learning for. Artem Lind & Aleskandr Tkachenko

Boosted Optimization for Network Classification. Bioinformatics Center Kyoto University

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

CAMCOS Report Day. December 9 th, 2015 San Jose State University Project Theme: Classification

Application of Support Vector Machine Algorithm in Spam Filtering

Short Survey on Static Hand Gesture Recognition

Tap Position Inference on Smart Phones

Introduction to Automated Text Analysis. bit.ly/poir599

Contents. Foreword to Second Edition. Acknowledgments About the Authors

Using CellProfiler for Biological Image Analysis Quantitative Analysis of Large-Scale Biological Image Data

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Object and Action Detection from a Single Example

Analyzing ICAT Data. Analyzing ICAT Data

CS 221: Object Recognition and Tracking

MULTICORE LEARNING ALGORITHM

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati

A Taxonomy of Semi-Supervised Learning Algorithms

Fast or furious? - User analysis of SF Express Inc

Voxel selection algorithms for fmri

Features: representation, normalization, selection. Chapter e-9

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Semi-supervised Learning

Effect of Principle Component Analysis and Support Vector Machine in Software Fault Prediction

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression

Improving Stack Overflow Tag Prediction Using Eye Tracking Alina Lazar Youngstown State University Bonita Sharif, Youngstown State University

1 Machine Learning System Design

A Comparison of Error Metrics for Learning Model Parameters in Bayesian Knowledge Tracing

Name of the lecturer Doç. Dr. Selma Ayşe ÖZEL

Multi-label classification using rule-based classifier systems

Application of Support Vector Machine In Bioinformatics

Content-Based Image Retrieval of Web Surface Defects with PicSOM

Feature Selection Using Modified-MCA Based Scoring Metric for Classification

A Survey on Postive and Unlabelled Learning

Identifying Important Communications

Logistic Regression: Probabilistic Interpretation

Non-linearity and spatial correlation in landslide susceptibility mapping

Experimenting with Multi-Class Semi-Supervised Support Vector Machines and High-Dimensional Datasets

CS229 Final Project One Click: Object Removal

Saliency Detection for Videos Using 3D FFT Local Spectra

Using Machine Learning for Classification of Cancer Cells

Supervised vs unsupervised clustering

Automated Canvas Analysis for Painting Conservation. By Brendan Tobin

CS 229 Final Project Report Learning to Decode Cognitive States of Rat using Functional Magnetic Resonance Imaging Time Series

Robot Learning. There are generally three types of robot learning: Learning from data. Learning by demonstration. Reinforcement learning

Transcription:

Cell, Cell: Classification of Segmented Images for Suitable Quantification and Analysis Derek Macklin, Haisam Islam, Jonathan Lu December 4, 22 Abstract While open-source tools exist to automatically segment and track cells in time-lapse microscopy experiments, the resulting output must be inspected to detect and remove spurious artifacts. In addition, data not exhibiting proper experimental controls must be discarded before performing downstream analysis. Currently, the data set is manually examined to accomplish these tasks. Here we present machine learning approaches for classifying the suitability of data traces for downstream processing and analysis. We integrate the most successful approaches into the experimental workflow. I. INTRODUCTION Time-lapse live-cell microscopy is a powerful technique to study the temporal dynamics of signaling pathways in single cells. Using fluorescently-labeled proteins, researchers can perform experiments to observe the interactions of signaling components to understand the underlying mechanisms of action. Analyzing data from these experiments requires image processing to quantify the spatio-temporal dynamics of labeled proteins []. Recently, sophisticated open-source software packages have been released to help automate biological image processing tasks such as cell segmentation [2] and tracking [3]. While these packages can be combined to create an image processing pipeline, the data traces that are produced must be inspected to determine if they are appropriate for inclusion in downstream analysis. Figure shows an overview of the workflow. Data traces produced by the image processing pipeline are excluded from further scientific analysis for two reasons. One reason is that the segmentation algorithm occasionally fails to uniquely identify and separate adjacent cells, thereby introducing large artifacts in some of the generated traces. Another reason is that some traces may not exhibit features associated with proper experimental controls. For example, some cells may not contain any fluorescently-labeled protein when the labeling process is not % efficient. Presently, cells and their corresponding traces are visually inspected to determine if they should be analyzed further. In this paper we present our efforts to apply supervised learning techniques to automate this classification of good and bad cells. A. Data II. METHODS Three researchers from the Covert Lab at Stanford University have provided data sets containing time courses of multiple features per cell and the associated good or bad labeling. Each of the three data sets reports features and labelings for approximately 2-3 cells. The features for each cell are produced by CellProfiler [2] which reports approximately 4 properties, including: MeanIntensity, StdIntensity, MeanIntensityEdge, Area, Perimeter, and Eccentricity. These features are reported for each cell in the frame, for all frames, thus generating two-dimensional time-series traces representing each cell. When a cell is not in the field of view, its features are all recorded as NaN for that frame. We have chosen not to aggregate the data from the three researchers because each researcher uses different experimental conditions and thus has different criteria for classifying cells as good or bad. We seek to develop and demonstrate an algorithm which works for each researcher when trained on the properties produced by CellProfiler in conjunction with the manually-determined labels. B. Features: Part I The provided data sets are not immediately amenable to machine learning algorithms. Each cell is represented by two dimensions of data: () the properties reported by CellProfiler, and (2) time. In addition, cells remain in the field of view for differing amounts of time. Thus, many commonly-used techniques which require input vectors to be in the same n-dimensional space are not well-suited

Fig. : A schematized (and satirized) overview of the experimental workflow, from data collection to publication. This project seeks to minimize the manual classification of data that must occur prior to the scientific analysis required for publication. unless missing values are carefully imputed for frames in which a cell is outside the field of view. Our initial approach is to choose features that are summary statistics of the temporal dynamics of each property reported by CellProfiler this circumvents the need to impute missing values. We take the minimum, maximum, mean, and standard deviation of each property over all frames in which a cell is in the field of view, thereby representing each cell with one dimension of data. We recognize that this transformation discards many of the temporal features that may be critical for this particular classification problem. Initially, though, we have decided to investigate different classification algorithms with this overview feature set. Prior to training different algorithms, we examine the pairwise correlations of all the features to remove highlycorrelated features. We also remove features which have very low correlation with the observed labeling. C. Features: Part II We augment our initial feature vector (described above) with more nuanced temporal information to improve classifier performance. For cells that remain within the field of view for at least 2 frames we compute the Fast Fourier Transform (FFT) on a sliding window of 2 time points and average the successive windows, thereby generating a vector in R 2 that represents the frequency content of differing-length traces. We append this to our initial feature vector. For cells that are recorded for less than 2 frames, we append a zero vector. D. Features: Part III To further improve performance, we add custom features that incorporate characteristics that each researcher takes into consideration when classifying his or her data set. For example, in Data Sets 2 and 3, researchers look for mean intensity changes in the first -2 frames. These intensity changes need not be large in absolute magnitude but must be noticeable compared to an initial baseline. Thus we normalize the MeanIntensity timeseries property to its starting and maximum value and append features describing the derivative of this normalized trace in the first frames. For Data Set 3 we also append a feature indicating how long a cell remains within the field of view. For Data Set we append features describing the temporal dynamics of Area since researchers look at the dynamics of cell area when performing manual classification for this data set. E. Logistic regression We perform logistic regression on each data set using the glmfit and glmval functions in MATLAB (The MathWorks, Natick, MA, USA). We have made a slight modification to the glmfit function to allow it to iterate for more than the default iterations. F. Naive Bayes We implement a Naive Bayes classifier with a multinomial event model. Since our features are reported as continuous values, we discretize these values into a finite number of bins so that we can model p(x i y) for each feature x i as a multinomial distribution. To perform the discretization, we take the maximum value of each feature, the minimum value of each feature, and divide that range into a number of bins. To select the number of bins, we perform cross-validation varying the number of bins from to, and select the number which maximizes area under the (receiver operator characteristic) curve. G. Support vector machine We train a support vector machine (SVM) using the LIBSVM library [4]. We select a radial basis kernel and select appropriate parameters by performing a grid search in log-space to maximize area under the (receiver operator characteristic) curve obtained by crossvalidation.

u 2 u 2 u 2 u 6 4 2 u u (a) Data Set (b) Data Set 2 (c) Data Set 3 Fig. 2: Initial Features (c.f. Section II-B) of each data set projected onto the first two principal components. H. Evaluation of learning algorithms For each learning algorithm we perform hold-out cross validation, training on % of the data and testing on the other %. Because the labelings are skewed with -2% of cells marked good depending on the researcher, we report area under the curve (AUC) rather than accuracy. In general we display and report AUC for precision-recall curves (AUC PR ). We also report AUC for receiver operator characteristic curves (AUC ROC ) which compare true-positive and false-positive rates. The relationship between precision-recall (PR) and receiver operator characteristic (ROC) curves is discussed in [] and the authors suggest that PR curves can be more informative for skewed data sets. To generate PR and ROC curves and compute their respective AUCs, we generally use MATLAB s perfcurve function. For SVMs we use slight modifications of LibSVM s plotroc function (which calls perfcurve). I. Principal component analysis In order to visualize our data in two dimensions, we use the svd function in MATLAB after standard preprocessing to normalize the mean and variance. J. Pipeline integration We integrate the most successful classifiers into the image processing pipeline to augment the manual classification process (the motivation for this project). To achieve this, we save the models and any associated normalization constants so that new data can be appropriately transformed prior to classification. III. RESULTS A. Classification results: Part I AUC ROC and AUC PR values for three different classification algorithms, trained and tested on our first set TABLE I: Classifier performance using initial feature set. Logistic Regression Naive Bayes Support Vector Machine Data Set Data Set 2 Data Set 3 AUC ROC AUC PR AUC ROC AUC PR AUC ROC AUC PR.8.48.94.72.6..92.82.93.7.96.8.87.63.9.79.93.4 of selected features (c.f. Section II-B), are reported in Table I. We see that Logistic Regression does not perform well for Data Sets and 3, while its performance is much more comparable to the Naive Bayes and SVM classifiers for Data Set 2. To investigate this, we observe Figure 2, which shows the initial features for all three data sets projected onto the first two principal components. We see that the data is not linearly separable in low-dimension for any of the data sets, but for Data Set 2, there is a cluster of good cells somewhat removed from bad cells. This may enable Logistic Regression to better separate data in Data Set 2. However, the bestperforming classifier for Data Set 2 is the SVM. To our surprise, Naive Bayes, when appropriately discretized for each data set, performs better than the SVM for Data Sets and 3. B. Classification results: Part II After characterizing the performance of the classification algorithms on the initial feature set, we attempt to make improvements by adding more discriminative features (discussed in Sections II-C and II-D), first by incorporating information describing the temporal dynamics. In particular, we focused on Data Set 2 since the

... AUC =.79. (a) Initial Features (c.f. Section II-B) AUC =.8. (b) With FFTs (c.f. Section II-C) AUC =.9. (c) Data-specific (c.f. Section II-D) Fig. 3: -recall curves for the SVM classifiers trained on Data Set 2. Performance, as measured by the area under the curve, increases as the feature set is improved and refined.. AUC =.73. Fig. 4: -recall curve for the SVM classifier trained on Data Set 3 using the full feature set (c.f. Section II-D).. AUC =.82. Fig. : -recall curve for the Naive Bayes classifier trained on Data Set using the initial feature set (c.f. Section II-B). researcher who provided it is currently the most invested in the automation of the classification task. Figure 3b shows the precision-recall (PR) curve of the SVM trained on the feature set described in Section II-C. As a reference, Figure 3a shows the PR curve when the SVM is trained on the initial feature set described in Section II-B. We see that the new feature set which includes a representation of the temporal dynamics improves SVM performance. We note that that the new feature set improves the SVM performance for Data Sets and 3, but the performance is still less than that of the Naive Bayes classifier trained on the initial feature set (data not shown). C. Classification results: Part III We again attempt to further improve classifier performance by incorporating features that reflect the criteria used by researchers when manually labeling their data sets, as described in Section II-D. Figure 3c shows the PR curve of the SVM trained on the full feature set. We see that the new features have once again improved performance in classifying Data Set 2. In addition, the full feature set improves SVM performance for Data Set 3 over that of Naive Bayes (trained on any of the feature sets). Figure 4 shows the associated PR curve. New features did not improve SVM performance on Data Set over that of a Naive Bayes classifier trained on the initial feature set (c.f. Section II-B). The PR curve for the Naive Bayes classifier trained on this data set is shown in Figure. D. Pipeline Integration The best-performing classifiers are integrated into the experimental workflow at the end of the image processing pipeline. The classifiers make an initial attempt at classifying cells and allow the researchers to decide the final labelings. The researchers use the initial labelings as they rapidly search through their data sets to identify and prioritize which results are worth manually classifying. This was not possible when the overwhelming number of

bad time-series traces were not differentiated from and therefore masked the good time-series traces. IV. CONCLUSION In this work we present our effort to augment and partially automate some of the manual classification tasks involved in processing data from time-lapse microscopy experiments. We have integrated the best-performing classifiers into the experimental workflow, enabling researchers to quickly observe trends in the data so that they can prioritize their analysis tasks. We hope to make further improvements in the future. ACKNOWLEDGEMENTS We thank Jake Hughey, Keara Lane, and Sergi Regot from the Covert Lab for providing their labeled data sets and for offering their feedback as we iterated designs. We thank Professor Andrew Ng and the teaching staff for their guidance this quarter. REFERENCES [] T. K. Lee and M. W. Covert, High-throughput, single-cell nfkb dynamics, Current Opinion in Genetics and Development, vol. 2, no. 6, pp. 677 683, 2, genetics of system biology. [2] L. Kamentsky, T. R. Jones, A. Fraser, M.-A. Bray, D. J. Logan, K. L. Madden, V. Ljosa, C. Rueden, K. W. Eliceiri, and A. E. Carpenter, Improved structure, function, and compatibility for cellprofiler: modular high-throughput image analysis software, Bioinformatics, 2. [3] D. Blair and E. Dufresne, The matlab particle tracking code repository. [Online]. Available: http://physics.georgetown.edu/matlab/ [4] C.-C. Chang and C.-J. Lin, LIBSVM: A library for support vector machines, ACM Transactions on Intelligent Systems and Technology, vol. 2, pp. 27: 27:27, 2, software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm. [] J. Davis and M. Goadrich, The relationship between precisionrecall and roc curves, in Proceedings of the 23rd international conference on Machine learning, ser. ICML 6. New York, NY, USA: ACM, 26, pp. 233 24. [Online]. Available: http://doi.acm.org/.4/43844.43874