Semen fertility prediction based on lifestyle factors

Similar documents
CS 229 Midterm Review

Fraud Detection using Machine Learning

Louis Fourrier Fabien Gaie Thomas Rolf

SEMINAL QUALITY EVALUATION WITH RBF NEURAL NETWORK

The exam is closed book, closed notes except your one-page cheat sheet.

Network Traffic Measurements and Analysis

Logical Rhythm - Class 3. August 27, 2018

Applying Supervised Learning

Detecting Thoracic Diseases from Chest X-Ray Images Binit Topiwala, Mariam Alawadi, Hari Prasad { topbinit, malawadi, hprasad

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Performance Evaluation of Various Classification Algorithms

SVM in Analysis of Cross-Sectional Epidemiological Data Dmitriy Fradkin. April 4, 2005 Dmitriy Fradkin, Rutgers University Page 1

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

CS 229: Machine Learning Final Report Identifying Driving Behavior from Data

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Data Mining Lecture 8: Decision Trees

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Artificial Neural Networks (Feedforward Nets)

Solution Sketches Midterm Exam COSC 6342 Machine Learning March 20, 2013

Fall 09, Homework 5

CS229 Final Project: Predicting Expected Response Times

Advanced Video Content Analysis and Video Compression (5LSH0), Module 8B

INTRO TO RANDOM FOREST BY ANTHONY ANH QUOC DOAN

ISyE 6416 Basic Statistical Methods - Spring 2016 Bonus Project: Big Data Analytics Final Report. Team Member Names: Xi Yang, Yi Wen, Xue Zhang

Bias-Variance Analysis of Ensemble Learning

Classification/Regression Trees and Random Forests

Using Machine Learning to Optimize Storage Systems

Predicting Popular Xbox games based on Search Queries of Users

Allstate Insurance Claims Severity: A Machine Learning Approach

Random Forest A. Fornaser

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

CS229 Lecture notes. Raphael John Lamarre Townshend

Predict the Likelihood of Responding to Direct Mail Campaign in Consumer Lending Industry

Regularization and model selection

Business Club. Decision Trees

Supervised vs unsupervised clustering

Support Vector Machines

The Basics of Decision Trees

Predicting Messaging Response Time in a Long Distance Relationship

Accelerometer Gesture Recognition

Machine Learning: An Applied Econometric Approach Online Appendix

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

Facial Expression Classification with Random Filters Feature Extraction

CS 229 Project Report:

CPSC 340: Machine Learning and Data Mining. Kernel Trick Fall 2017

Classifying Depositional Environments in Satellite Images

Using Machine Learning for Classification of Cancer Cells

Algorithms: Decision Trees

Fast or furious? - User analysis of SF Express Inc

The Bootstrap and Jackknife

Lecture 25: Review I

Distance Weighted Discrimination Method for Parkinson s for Automatic Classification of Rehabilitative Speech Treatment for Parkinson s Patients

STA 4273H: Statistical Machine Learning

7. Decision or classification trees

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

CS 8520: Artificial Intelligence. Machine Learning 2. Paula Matuszek Fall, CSC 8520 Fall Paula Matuszek

Machine Learning in Biology

CS 8520: Artificial Intelligence

Machine Learning for NLP

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Chapter 2 Basic Structure of High-Dimensional Spaces

CS Machine Learning

Machine Learning for Pre-emptive Identification of Performance Problems in UNIX Servers Helen Cunningham

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Machine Learning. Chao Lan

Lecture 9: Support Vector Machines

Neural Networks (pp )

Predicting Diabetes and Heart Disease Using Diagnostic Measurements and Supervised Learning Classification Models

Bioinformatics - Lecture 07

Oliver Dürr. Statistisches Data Mining (StDM) Woche 12. Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften

MIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA

Please write your initials at the top right of each page (e.g., write JS if you are Jonathan Shewchuk). Finish this by the end of your 3 hours.

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Classification. Instructor: Wei Ding

Bagging for One-Class Learning

CSE 158. Web Mining and Recommender Systems. Midterm recap

Tutorials Case studies

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Statistics: Normal Distribution, Sampling, Function Fitting & Regression Analysis (Grade 12) *

Contents. Preface to the Second Edition

Estimating Data Center Thermal Correlation Indices from Historical Data

Programming Exercise 6: Support Vector Machines

Last time... Coryn Bailer-Jones. check and if appropriate remove outliers, errors etc. linear regression

CSC 411 Lecture 4: Ensembles I

CAMCOS Report Day. December 9 th, 2015 San Jose State University Project Theme: Classification

Tree-based methods for classification and regression

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

CS6220: DATA MINING TECHNIQUES

Introduction to Automated Text Analysis. bit.ly/poir599

Assignment 2. Classification and Regression using Linear Networks, Multilayer Perceptron Networks, and Radial Basis Functions

Sensor-based Semantic-level Human Activity Recognition using Temporal Classification

Tutorial on Machine Learning Tools

Programming Exercise 5: Regularized Linear Regression and Bias v.s. Variance

Transcription:

Koskas Florence CS 229 Guyon Axel 12/12/2014 Buratti Yoann Semen fertility prediction based on lifestyle factors 1 Introduction Although many people still think of infertility as a "woman s problem," in about 40% of infertile couples, the man is the sole cause or a contributing cause of the inability to conceive. Indeed, subfertility affects one in 20 men 1. Male infertility has many causes : not only hormonal imbalances and/or physical problems but also psychological and/or behavioral problems. The aim of this project is to evaluate the importance of some lifestyle and environmental factors in evaluating infertility. 2 Methodology 2.1 Study Population We use the data collected and shared 2 in 2013 by the Department of Biotechnology of University of Alicante : 100 young healthy volunteers among students who were between 18 and 36 years old provided a semen sample for analysis as well as their socio-demographic data, environmental factors, health status, and life habits. Students with previous known reproductive alterations were excluded from the statistical analysis. 2.2 Features and labels available from questionnaires We have 9 features available for our 100 training examples : 1. Season in which the analysis was performed : winter (-1), spring (-0.33), summer (0.33), fall (1) 2. Age at the time of analysis : between 18 years old and 36 years old scaled to [0, 1] 3. Child diseases (i.e. chicken pox, measles, mumps, polio) : yes (0), no (1) 4. Accident or serious trauma : yes (0), no (1) 5. Surgical intervention : yes (0), no (1) 6. High fevers in the last year : less than three months ago (-1), more than three months ago (0), no (1) 7. Frequency of alcohol consumption : several times a day (0), every day (0.2), several times a week (0.4), once a week (0.6), hardly ever (0.8) or never (1) 8. Smoking habit never (-1), occasional (0) or daily (1) 9. Number of hours spent sitting per day between 1 and 16 scaled to [0, 1] Our output is the semen analysis diagnosis : the label is N when semen is normal and O if it is altered. We match the label as follow for our algorithms :. Semen is normal : N -> 0. Semen is altered : O -> 1 1. The National Center for Biotechnology Information, U.S. National Library of Medicine 2. ttps ://archive.ics.uci.edu/ml/datasets/fertility 1

2.3 Roadmap For this project, our main goals are :. Predict a young man s fertility based on his answers to 9 personal history and lifestyle questions. Evaluate which of these 9 features have the greatest impact on the prediction s outcome To achieve these goals, we used different machine learning algorithms and inferred the following insights : Logistic regression and Kernelized SVM : Data is not linearly separable even in the high-dimensional feature space generated by a Gaussian or Sigmoid kernel. Decision tree algorithm : This algorithm gives us better results, with a relatively simple decision tree. 3 Logistic regression and Kernelized SVM 3.1 Logistic regression We implement a logistic regression coupled with Newton s method. The results show the data is not linearly separable. Although this approach is unsuccessful, we can drive some insights from the parameters theta found from the regression. Comparing the absolute values of the different coefficients of our parameter vector theta, we can get a sense of which features have the highest impact on the outcome. Our observations confirm our intuitions : for example, we find that age is negatively correlated with fertility, which seems quite reasonable. We obtain : θ Newton = 1.8748 0.2292 1.8151 0.1530 0.05941 0.0695 0.2426 1.2262 0.0699 0.8409 θ features = Intercept term Season Age Child disease Accident or trauma Surgical intervention High fever Alcohol consumption Smoking habit Sitting habit We bolded the most highly weighted features in the logistic regression. As such, the 3 most important features seem to be (in order) : age, frequency of alcohol consumption and sitting habit. The training error is equal to 12% which shows that we can t separate the data (12 positive example out of 100). We also plotted the different features against each other, we don t notice any particular correlations between the features. 3.2 Kernelized SVM Applying logistic regression taught us the data is not linearly separable in the current feature space. Hence, we choose to run an SVM algorithm with l 1 regularization. The objective is to find a separating hyperplane in a higher-dimensional feature space generated by applying a Kernel to the data. We use a Polynomial, a Gaussian and a Sigmoid Kernel. We find the data is not linearly separable in the high-dimensional feature space for any of the Kernels used. However the data seems to show structure that is not captured by the Kernelized SVM. To potentially overcome this instance of underfitting and separate the data, we believe we would need additional relevant features characterizing our training set. 2

Below is the result of the SVM algorithm using a Gaussian Kernel. We use a PCA method to represent the data. The X and Y axes are the 2 vectors that maximize the variance of the projections of the data on a 2-dimensional space. These 2 principal components of the data are linear combinations of the features. Figure 1 Red and green data points respectively represent positive and negative outcome examples. The Kernelized SVM predicts all training examples to be negative (green background). 4 Decision Tree algorithm 4.1 Main idea A decision tree is a non-linear model that splits training examples into different bags of data, called leaves, so that we can make a more accurate prediction in each leaf. To build the tree and get the data clustered in these leaves, we select the feature with the heaviest weight in a linear regression. We split the data above and below a certain threshold for this feature and we recurse on the newly created subset. When the tree depth exceeds a specific, carefully chosen threshold, we stop and the lastly created subsets are our leaves. The proportion of negative or positive data points in each leaf is an estimate of the probability of being negative or positive respectively, for any new test example that happens to belong to the leaf. Using the optimal tree depth is key to the performance of the algorithm. Creating a deeper tree will lead to very few data points per leaf and result in overfitting, while stopping too soon will lead to underfitting. How do we find the optimal tree depth named the Complexity Parameter? If we plot the test error as a function of the tree depth, E test = f(d), we can derive the tree depth D that achieves the minimum of the test error. An usual Complexity Parameter is taken to be one standard deviation below this minimum : CP = D σ testerror Figure 2 Decision tree algorithm and Complexity Parameter derivation 3

4.2 Fertility decision tree We randomly divide our data set into two subsets : training set (50 %) and test set (50 %), using the R module : rpart. In figure 3 is the tree we find when training the decision tree algorithm described above on our training set : Figure 3 Decision tree algorithm result We obtain 6 bags of data. Each leaf displays the probability of the outcome being positive i.e. the man being infertile. For example, according to our data and the model used : a man older than 29.8 years old, who experienced an accident or trauma in the past, and who sits more than 6.2 hours per day has a 54 % chance to be infertile. While the significance of these results can be further debated and analyzed, the analysis suggests at least that fertility is not an intrinsic permanent characteristic but life events like a car accident can potentially alter male fertility. We repeat the tree generation process a hundred times, and we always get the same decision features. Although the training set is randomly drawn from 50% of our total data set. Interestingly, compared to logistic regression, the decision tree algorithm gives us a different hierarchy of the features : Age, Traumas, Surgery and Time spent sitting. However, just like with logistic regression, this should be taken with caution given the overall results of the models. We can note for example that our tree tells us that men younger than 27.3 years old are more likely to be infertile than men between 27.3 and 29.8, which is definitely a surprising result that could be questioned. We then test our model on the test set. Below is the confusion matrix describing the test error : A confusion matrix is particularly relevant to analyze results since the number of positive examples is fairly small compared to the total dataset size (12 out of 100). We can see that our tree algorithm only predicts half of the positive examples correctly, while more than 97% of the negative example are predicted correctly. We believe our algorithm does poorly at predicting infertility mainly because the initial database is extremely small and biased. The next step we are considering to improve this algorithm is to implement bootstrapping on the data set. 4

5 Further discussion & future work Looking at the results of both SVM and decision tree, we can build the following table : The main challenge with this project was the very small number of data points in the set (100), as well as a limited number of features (9), and we wanted to see how far we could go with such constraints. The bias of the data set towards normal fertility (as in real life actually) was a hindrance, which could be resolved with bootstrapping (we could greatly improve our results by collecting more data, but this has its own challenges). Bootstrapping consists in artificially obtaining a larger data set from the original one. Given our data set of 100 examples, 40 bags of 34 data points are created (i.e. one third of the size of the original data set). In each bag of 34 data points, are included :. data points (50%) randomly drawn from the set of normal samples in the original data set. data points (50%) randomly drawn from the set of altered samples in the original data set On these 40 bags of data, a decision tree algorithm is then trained. On the test set, each tree gives us a prediction. These predictions are averaged to obtain the final prediction. Alternatively, as we mentioned in the SVM section, we expect that additional features would enable us to separate the data and would thus greatly improve the value of the data set. In fact, the university of Alicante that published the data also had access to more features to perform their analysis 3. From discussions with the class teaching team (Dave Deriso, project TA) and visitors at the poster session, it appears that random forest or boosting our tree might be other interesting tree algorithm variant we could try. Finally, the same paper uses CNN (Convolution Neural Networks) as a prediction tool, and using deep learning can help the accuracy of the prediction. 3. Predicting seminal quality with artificial intelligence methods, David Gil, Jose Luis Girela, Joaquin De Juan, M. Jose Gomez-Torres, Magnus Johnsson (2012) 5