Feature-based Comparison of Flow Cytometry Data

Size: px

Start display at page:

Download "Feature-based Comparison of Flow Cytometry Data"

Candace Doyle
5 years ago
Views:

1 Feature-based Comparison of Flow Cytometry Data by Alice Yue B.Sc., University of Victoria, 2015 Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science in the School of Computing Science Faculty of Applied Science c Alice Yue 2017 SIMON FRASER UNIVERSITY Summer 2017 All rights reserved. However, in accordance with the Copyright Act of Canada, this work may be reproduced without authorization under the conditions for Fair Dealing. Therefore, limited reproduction of this work for the purposes of private study, research, education, satire, parody, criticism, review and news reporting is likely to be in accordance with the law, particularly if cited appropriately.

2 Approval Name: Degree: Title: Alice Yue Master of Science (Computing Science) Examining Committee: Chair: Joseph Peters Professor Cedric Chauve Senior Supervisor Professor Department of Mathematics Simon Fraser University Ryan Brinkman Co-supervisor Professor Department of Medical Genetics University of British Columbia Leonid Chindelevitch Supervisor Assistant Professor School of Computing Science Simon Fraser University Kay C. Wiese Internal Examiner Associate Professor School of Computing Science Simon Fraser University Feature-based Comparison of Flow Cytometry Data Date Defended: 8 August 2017 ii

3 Abstract Flow cytometry (FCM) bioinformatics is a sub-field of bioinformatics, aimed at developing effective and efficient computational tools to store, organize, and analyze highthroughput/dimensional FCM data. Flow cytometers are capable of analyzing thousands of cells per second for up to 40 features. These features primarily signal the presence of different proteins on cells in the bloodstream. Hence contributing large amounts of data towards the big biological data paradigm. The data that a flow cytometer outputs from a biological sample, is called a FCS file. The International Mouse Phenotyping Consortium (IMPC) is a collaboration between 23 international institutions and funding organizations. Its aim is to decipher the function of 20,000 mouse genes. IMPC is doing so by breeding mice with a certain gene knocked out (KO), cancelling the function of that gene. In turn, FCM is used to measure the immunological changes correlated to this knockout. Many tools exist to classify FCS files. However, there is a lack of tools to conduct unsupervised clustering of FCS files. One goal of IMPC is to compare and contrast KO genes, hence IMPC becomes a prime motivation for this problem. As such, this thesis outlines a data processing pipeline used to isolate features for each FCS file. We then test the different types of features extracted on a benchmark data set from the FlowCAP-II challenge, containing data from healthy persons and patients with AML (acute myeloid leukemia). We then evaluate how well these features separate out FCS files of different origin (i.e. healthy vs AML). Keywords: Bioinformatics; Flow Cytometry; Feature Design iii

4 Dedication To my mother. iv

5 Acknowledgements I would like to show my gratitude to my senior and co-supervisors Dr. Cedric Chauve and Dr. Ryan Brinkman. Thank you for the opportunity to learn and solve meaningful problems together taking the time for many thought provoking discussions, and your most generous support and patience throughout my masters education. I am honoured to work with you, as my supervisors and as life mentors. I would also like to express my appreciations to my defence committee for your invaluable feedback. Thank you to my supervisory committee member, Dr. Leonid Chindelevitch for always providing inspiring food for thought during the bioinformatics course and program meetings, Dr. Kay Wiese for your encouragement and for agreeing to examine my super long MSc thesis, and Dr. Joseph Peters for being the defence chair and your patience towards my un-finish-able list of algorithm questions. Last but not least, thank you to my family, friends, and lab-mates for your companionship, and support. v

6 Table of Contents Approval Abstract Dedication Acknowledgements Table of Contents List of Tables List of Figures ii iii iv v vi ix xii 1 Introduction and Background Flow Cytometry (FCM) Mass Cytometry Flow Cytometry (FCM) Bioinformatics Applications Motivation and Contributions Flow Cytometry (FCM) Bioinformatics FCM Data Processing Pipeline Pre-Processing Cell Population Identification Applications FCS file Classification Biomarker Cell Population Identification Remarks Data and Methods Data IMPC FlowCAP-II AML vi

7 3.2 Pre-Processing Compensation Transformation Quality Control Cell Population Identification Automated Gating Cell Population Enumeration Cell Hierarchy Cell Count Normalization Feature Design Absolute Features Phenodeviant features Trimmed Features Distance Matrix Calculation Distance Evaluation Metrics Clustering and Classification Methods Scoring Metrics Results Panels External Validation of Classification Results Internal and External Validation of Clustering Results Internal Validation of Distance Matrices AML vs Healthy External Validation of Classification Results Internal and External Validation for Clustering Results Internal Validation of Distance Matrices Discussion Features Node vs Edge Features Phenodeviant vs Absolute Features Trimmed vs Un-trimmed Features Louvain Clustering Score Exceptions Layers Node vs Edge Features Trimmed vs Un-trimmed Features Distance Metrics Conclusion and Perspectives 90 vii

8 6.1 Thesis Overview Results and Discussion Summary Significance Future Prospects FCM Data Analysis Pipeline Improvements Fine-tuning Phenodeviant Features FCS file Cluster Comparison Bibliography 94 Appendix A Code 103 Appendix B Results (Figures) 104 B.1 Panels B.1.1 Internal and External Validation of Clustering Results B.1.2 Internal Validation of Distance Matrices B.1.3 Figures by Feature B.1.4 Figures by Layer B.1.5 Figures by Distance Metric Appendix C Results (Tables) 124 C.1 Panels C.1.1 External Validation of Classification Results C.1.2 Internal and External Validation of Clustering Results C.1.3 Internal Validation of Distance Matrices C.2 AML vs Healthy (Average of All Panels) C.2.1 External Validation of Classification Results C.2.2 Internal and External Validation of Clustering Results C.2.3 Internal Validation of Distance Matrices viii

9 List of Tables Table 2.1 Table 3.1 Examples of Cell Population Identification Tools categorized into three types: Supervised Classification, Unsupervised Clustering, and Automated Gating Co-occurrence True/False Positive/Negative Definition for any two FCS files Table C.1 External Validation of Classification Results for the FlowCAP-II data set on the Panels variable: K-NN scores Table C.2 Internal and External Validation of Clustering Results for the FlowCAP- II data set on the Panels variable: Spectral Clustering scores Table C.3 Internal and External Validation of Clustering Results for the FlowCAP- II data set on the Panels variable: Louvain Clustering scores Table C.4 Internal and External Validation of Clustering Results for the FlowCAP- II data set on the Panels variable: Agglomerative Clustering scores. 130 Table C.5 Internal and External Validation of Clustering Results for the FlowCAP- II data set on the Panels variable: K-medoid Clustering scores Table C.6 Internal Validation of Distance Matrices for the FlowCAP-II data set on the Panels variable: Distance Matrix scores Table C.7 External Validation of Classification Results for the FlowCAP-II data set on the AML vs Healthy variable: K-NN scores for average of All Panels (Trimmed Absolute Features) Table C.8 External Validation of Classification Results for the FlowCAP-II data set on the AML vs Healthy variable: K-NN scores for average of All Panels (Trimmed phenodeviant Features) Table C.9 External Validation of Classification Results for the FlowCAP-II data set on the AML vs Healthy variable: K-NN scores for average of All Panels (Un-trimmed Absolute Features) Table C.10 External Validation of Classification Results for the FlowCAP-II data set on the AML vs Healthy variable: K-NN scores for average of All Panels (Un-trimmed phenodeviant Features) ix

10 Table C.11 Internal and External Validation of Clustering Results for the FlowCAP- II data set on the AML vs Healthy variable: Spectral Clustering scores for average of All Panels (Trimmed Absolute Features) Table C.12 Internal and External Validation of Clustering Results for the FlowCAP- II data set on the AML vs Healthy variable: Spectral Clustering scores for average of All Panels (Trimmed phenodeviant Features) Table C.13 Internal and External Validation of Clustering Results for the FlowCAP- II data set on the AML vs Healthy variable: Spectral Clustering scores for average of All Panels (Un-trimmed Absolute Features) Table C.14 Internal and External Validation of Clustering Results for the FlowCAP- II data set on the AML vs Healthy variable: Spectral Clustering scores for average of All Panels (Un-trimmed phenodeviant Features) Table C.15 Internal and External Validation of Clustering Results for the FlowCAP- II data set on the AML vs Healthy variable: Louvain Clustering scores for average of All Panels (Trimmed Absolute Features) Table C.16 Internal and External Validation of Clustering Results for the FlowCAP- II data set on the AML vs Healthy variable: Louvain Clustering scores for average of All Panels (Trimmed phenodeviant Features) Table C.17 Internal and External Validation of Clustering Results for the FlowCAP- II data set on the AML vs Healthy variable: Louvain Clustering scores for average of All Panels (Un-trimmed Absolute Features) Table C.18 Internal and External Validation of Clustering Results for the FlowCAP- II data set on the AML vs Healthy variable: Louvain Clustering scores for average of All Panels (Un-trimmed phenodeviant Features) Table C.19 Internal and External Validation of Clustering Results for the FlowCAP- II data set on the AML vs Healthy variable: Agglomerative Clustering scores for average of All Panels (Trimmed Absolute Features) Table C.20 Internal and External Validation of Clustering Results for the FlowCAP- II data set on the AML vs Healthy variable: Agglomerative Clustering scores for average of All Panels (Trimmed phenodeviant Features) Table C.21 Internal and External Validation of Clustering Results for the FlowCAP- II data set on the AML vs Healthy variable: Agglomerative Clustering scores for average of All Panels (Un-trimmed Absolute Features) Table C.22 Internal and External Validation of Clustering Results for the FlowCAP- II data set on the AML vs Healthy variable: Agglomerative Clustering scores for average of All Panels (Un-trimmed phenodeviant Features) 150 Table C.23 Internal and External Validation of Clustering Results for the FlowCAP- II data set on the AML vs Healthy variable: K-medoid Clustering scores for average of All Panels (Trimmed Absolute Features) x

11 Table C.24 Internal and External Validation of Clustering Results for the FlowCAP- II data set on the AML vs Healthy variable: K-medoid Clustering scores for average of All Panels (Trimmed phenodeviant Features) Table C.25 Internal and External Validation of Clustering Results for the FlowCAP- II data set on the AML vs Healthy variable: K-medoid Clustering scores for average of All Panels (Un-trimmed Absolute Features) Table C.26 Internal and External Validation of Clustering Results for the FlowCAP- II data set on the AML vs Healthy variable: K-medoid Clustering scores for average of All Panels (Un-trimmed phenodeviant Features) 154 Table C.27 Internal Validation of Distance Matrices for the FlowCAP-II data set on the AML vs Healthy variable: Distance Matrix scores for average of All Panels (Trimmed Absolute Features) Table C.28 Internal Validation of Distance Matrices for the FlowCAP-II data set on the AML vs Healthy variable: Distance Matrix scores for average of All Panels (Trimmed phenodeviant Features) Table C.29 Internal Validation of Distance Matrices for the FlowCAP-II data set on the AML vs Healthy variable: Distance Matrix scores for average of All Panels (Un-trimmed Absolute Features) Table C.30 Internal Validation of Distance Matrices for the FlowCAP-II data set on the AML vs Healthy variable: Distance Matrix scores for average of All Panels (Un-trimmed phenodeviant Features) xi

12 List of Figures Figure 1.1 FCM Machinery and How FCM Analyzes Biological Samples top half of schematic inspired by [53] Figure 1.2 Applications of FCM Bioinformatics Figure 1.3 Applications of FCM Bioinformatics on the IMPC Figure 2.1 Data Processing Pipeline Figure 3.1 Data Processing Pipeline Figure 3.2 Sample FlowCAP-II and IMPC FCS files and their Cell Populations Visualized using t-sne and FlowSOM (FlowCAP-II FCS file cell populations identified using gating strategies [26, 95, 96]) Figure 3.3 FCM Data Progressing Pipeline: Pre-processing; Cells of a FCS file are plotted on markers CD5 and CD11b, with the colours representing the density Figure 3.4 FCM Data Processing Pipeline: Quality Control Figure 3.5 FCM Data Processing Pipeline: the gating strategy Figure 3.6 FCM Data Processing Pipeline: Automated gating and its ability to reduce variance caused by residual and unknown variables [1] Figure 3.7 Cell Hierarchy: A Representation of the FCS file Figure 3.8 FCM Data Processing Pipeline: Cell Count Normalization (Example 1) Figure 3.9 FCM Data Processing Pipeline: Cell Count Normalization (Example 2) Figure 3.10 FCM Data Processing Pipeline: Feature Design (Examples for Absolute Features) Figure 3.11 FCM Data Processing Pipeline: Feature Design (Examples for phenodeviant Features) xii

13 Figure 3.12 Figure 4.1 Figure 4.2 Figure 4.3 Figure 4.4 Figure 4.5 Figure 4.6 Figure 4.7 Figure 4.8 Figure 4.9 Figure 4.10 External Validation of Clustering Results for the FlowCAP-II data set on the AML vs Healthy variable: Louvain Clustering scores averaged across All Panels plotted for Layer 7, Manhattan Distance Metric, & All Features (see legend) (Proportion of longest distance edges deleted vs F Measure (see Section )) see Section 4: How result plots are organized for more details on the plot External Validation of Classification Results for the FlowCAP-II data set on the Panels variable: Distance Matrix scores plotted for All Distance Metrics (Features vs F Measure) Internal Validation of Distance Matrices for the FlowCAP-II data set on the Panels variable: Distance Matrix scores plotted for All Distance Metrics (Layers vs F Measure) External Validation of Classification Results for the FlowCAP-II data set on the Panels variable: Distance Matrix scores plotted for All Distance Metrics (Distance Metrics vs F Measure) External Validation of Classification Results for the FlowCAP-II data set on the AML vs Healthy variable: K-NN scores plotted for All Panels & Distance Metrics (Features vs F Measure) External Validation of Classification Results for the FlowCAP-II data set on the AML vs Healthy variable: K-NN scores plotted for All Panels & Distance Metrics (Layers vs F Measure) External Validation of Classification Results for the FlowCAP-II data set on the AML vs Healthy variable: K-NN scores plotted for All Panels & Distance Metrics (Distance Metrics vs F Measure) 63 External Validation of Clustering Results for the FlowCAP-II data set on the AML vs Healthy variable: scores plotted for All Panels & Distance Metrics (Features vs F Measure) Internal Validation of Clustering Results for the FlowCAP-II data set on the AML vs Healthy variable: scores plotted for All Panels & Distance Metrics (Features vs Median Silhouette Index) Internal Validation of Clustering Results for the FlowCAP-II data set on the AML vs Healthy variable: scores plotted for All Panels & Distance Metrics (Features vs Pearson Gamma) External Validation of Clustering Results for the FlowCAP-II data set on the AML vs Healthy variable: scores plotted for All Panels & Distance Metrics (Layers vs F Measure) xiii

14 Figure 4.11 Figure 4.12 Figure 4.13 Figure 4.14 Figure 4.15 Figure 4.16 Figure 4.17 Figure 4.18 Figure 4.19 Figure 4.20 Figure 4.21 Figure 4.22 Figure 4.23 Internal Validation of Clustering Results for the FlowCAP-II data set on the AML vs Healthy variable: scores plotted for All Panels & Distance Metrics (Layers vs Median Silhouette Index) Internal Validation of Clustering Results for the FlowCAP-II data set on the AML vs Healthy variable: scores plotted for All Panels & Distance Metrics (Layers vs Pearson Gamma) External Validation of Clustering Results for the FlowCAP-II data set on the AML vs Healthy variable: scores plotted for All Panels & Distance Metrics (Distance Metrics vs F Measure) Internal Validation of Clustering Results for the FlowCAP-II data set on the AML vs Healthy variable: scores plotted for All Panels & Distance Metrics (Distance Metrics vs Median Silhouette Index) 73 Internal Validation of Clustering Results for the FlowCAP-II data set on the AML vs Healthy variable: scores plotted for All Panels & Distance Metrics (Distance Metrics vs Pearson Gamma) Internal Validation of Distance Matrices for the FlowCAP-II data set on the AML vs Healthy variable: scores plotted for All Panels & Distance Metrics (Features vs NCA) Internal Validation of Distance Matrices for the FlowCAP-II data set on the AML vs Healthy variable: scores plotted for All Panels & Distance Metrics (Features vs Median Silhouette Index) Internal Validation of Distance Matrices for the FlowCAP-II data set on the AML vs Healthy variable: scores plotted for All Panels & Distance Metrics (Features vs Pearson Gamma) Internal Validation of Distance Matrices for the FlowCAP-II data set on the AML vs Healthy variable: scores plotted for All Panels & Distance Metrics (Layers vs NCA) Internal Validation of Distance Matrices for the FlowCAP-II data set on the AML vs Healthy variable: scores plotted for All Panels & Distance Metrics (Layers vs Median Silhouette Index) Internal Validation of Distance Matrices for the FlowCAP-II data set on the AML vs Healthy variable: scores plotted for All Panels & Distance Metrics (Layers vs Pearson Gamma) Internal Validation of Distance Matrices for the FlowCAP-II data set on the AML vs Healthy variable for All Panels Distance Matrix vs NCA) Internal Validation of Distance Matrices for the FlowCAP-II data set on the AML vs Healthy variable: scores plotted for All Panels & Distance Metrics (Distance Metrics vs Median Silhouette Index) 83 xiv

15 Figure 4.24 Figure B.1 Figure B.2 Figure B.3 Figure B.4 Figure B.5 Figure B.6 Figure B.7 Figure B.8 Figure B.9 Figure B.10 Figure B.11 Figure B.12 Figure B.13 Internal Validation of Distance Matrices for the FlowCAP-II data set on the AML vs Healthy variable: scores plotted for All Panels & Distance Metrics (Distance Metrics vs Pearson Gamma) External Validation of Clustering Results for the Panels variable: K-NN scores for all Distance Metrics (Features vs F Measure). 105 Internal Validation of Clustering Results for the Panels variable: scores for all Distance Metrics (Features vs Median Silhouette Index) Internal Validation of Clustering Results for the Panels variable: scores for all Distance Metrics (Features vs Pearson Gamma). 107 External Validation of Clustering Results for the Panels variable: scores for all Distance Metrics (Layers vs F Measure) Internal Validation of Clustering Results for the Panels variable: scores for all Distance Metrics (Layers vs Median Silhouette Index) Internal Validation of Clustering Results for the Panels variable: scores for all Distance Metrics (Layers vs Pearson Gamma) External Validation of Clustering Results for the Panels variable: scores for all Distance Metrics (Distance Metrics vs F Measure) 112 Internal Validation of Clustering Results for the Panels variable: scores for all Distance Metrics (Distance Metrics vs Median Silhouette Index) Internal Validation of Clustering Results for the Panels variable: scores for all Distance Metrics (Distance Metrics vs Pearson Gamma) Internal Validation of Distance Matrices for the FlowCAP-II data set on the Panels variable: Distance Matrix scores for all Distance Metrics (Features vs NCA) Internal Validation of Distance Matrices for the FlowCAP-II data set on the Panels variable: Distance Matrix scores for all Distance Metrics (Features vs Median Silhouette Index) Internal Validation of Distance Matrices for the FlowCAP-II data set on the Panels variable: Distance Matrix scores for all Distance Metrics (Features vs Pearson Gamma) Internal Validation of Distance Matrices for the FlowCAP-II data set on the Panels variable: Distance Matrix scores for all Distance Metrics (Layers vs NCA) xv

16 Figure B.14 Figure B.15 Figure B.16 Figure B.17 Figure B.18 Internal Validation of Distance Matrices for the FlowCAP-II data set on the Panels variable: Distance Matrix scores for all Distance Metrics (Layers vs Median Silhouette Index) Internal Validation of Distance Matrices for the FlowCAP-II data set on the Panels variable: Distance Matrix scores for all Distance Metrics (Layers vs Pearson Gamma) Internal Validation of Distance Matrices for the FlowCAP-II data set on the Panels variable: Distance Matrix scores for all Distance Metrics (Distance Metrics vs NCA) Internal Validation of Distance Matrices for the FlowCAP-II data set on the Panels variable: Distance Matrix scores for all Distance Metrics (Distance Metrics vs Median Silhouette Index) Internal Validation of Distance Matrices for the FlowCAP-II data set on the Panels variable: Distance Matrix scores for all Distance Metrics (Distance Metrics vs Pearson Gamma) xvi

17 Chapter 1 Introduction and Background The life science community has the ability to create large amounts of biological data at drastically decreasing costs [68]. This has created an abundance of biological data and the lack of tools to interpret them. This led to the rise of bioinformatics. Bioinformatics is an interdisciplinary field of research that aims to develop computational tools to mine useful information out of biological data. The biological data we will be focusing on in this thesis is flow cytometry (FCM). 1.1 Flow Cytometry (FCM) A cytometer is a high-throughput apparatus capable of simultaneously measuring more than 40 features per single cell, for thousands of cells per second [4, 19, 20]. With its ability to produce big biological data, cytometers have become a routine part of research and diagnosis of diseases of the immune system, such as leukaemia and lymphoma [106]. As such, we will discuss cytometry in the context of the immune system (i.e. the biological samples processed by the FCM here will contain different types of immune cells from organs such as the spleen, bone marrow, and blood [54]). The cytometer was first invented in 1953, paving the way for cell sorting [41, 31]. This is a process where one identifies what cell type or cell population each cell belongs to, based on the proteins it contains assuming each cell population can be identified by a unique combination of proteins. In other words, a cell population in a FCS file is defined as a group of cells containing the same group or subgroup of proteins. Flow Cytometers (FCM) detect these proteins via fluorescence. To do so, a biological sample is first mixed with several different types of markers. Each marker attaches to a certain target protein on the cells. Each marker has an associated fluorochrome which is able to emit a fluorescence, a light of a certain wavelength, or colour, under stimulation. Hence, the different markers, or coloured light, present or not present on a cell represents the types of proteins the cell contains. For example, the fluorochrome of marker A emits 1

18 Figure 1.1: FCM Machinery and How FCM Analyzes Biological Samples top half of schematic inspired by [53] 2

19 a yellow coloured fluorescence and only attaches to protein 1. If a solution of marker A is mixed with a sample of cells, then only the cells with protein 1 would emit a yellow fluorescence. This processed cell sample suspended in fluid is then analyzed by the flow cytometer. Once in the machine, the cells are first aligned into a single file stream by sheath fluid in a component of the machine called the flow cell. After the cells are focused, they are parsed by a laser one at a time. In reaction to the laser, the markers on the cell emit fluorescence detected by an array of photomultiplying tubes (PMTs) each measuring light of a certain range of wavelength. Finally, whether or not a marker is present is determined by the fluorescence intensity (FI). FI is the brightness of a detected fluorescence (i.e. if the marker A s FI value is high on a cell, then that cell has marker A or is A+, marker A positive, and if the marker A s FI value is low on a cell, then that cell does not have marker A or is A-, marker A negative). Therefore, an alternative definition of a cell population is a group of cells that have similar FI values for a group of markers. At the same time, flow cytometers also detect a cell s physical characteristics namely its size (forward scatter FS) and granularity (side scatter SS). In summary, for each sample, the machine outputs a file in the Flow Cytometry Standard (FCS) [100]. This FCS file includes a R L matrix where R is the number of cells, L is the number of markers, and each value in the matrix reflects a FI value. The matrix also includes two or more columns for the physical characteristics of the cells (FS, SS) but for simplicity, we will presume the features of a cell to be marker FI s. Usually, a biologist would analyze a single biological sample using different sets of markers, or different panels. This would result in multiple such matrices per sample, one per panel. This thesis will simply refer to a single matrix as a FCS file Mass Cytometry First commercialized in 2009, DVS Sciences marketed a variation of FCM called mass cytometry. While FCM detects a spectrum of light emitted by the (fluorochrome on the) markers of a cell, mass cytometers use mass spectometry, a technique that detects the massto-charge ratio of ionized chemical species on a cell. In turn, these chemical species serve the same function as the markers (i.e. a specific mass-to-charge ratio represents a specific chemical species which in turn reflects whether or not a certain protein is present on the cell). The mass cytometer outputs its data in the FCS format such that they can be analyzed using the same computational tools that can analyze outputs of FCM. Theoretically, mass spectometry is more precise in that a single type of chemical species can only be detected as a single mass-to-charge ratio value, in contrast to the range of wavelengths a single (fluorochromes on a) marker can be detected as. In addition, it can analyze over 40 of such chemical species features per cell [102]. In regards to its drawbacks, mass cytometry 3

20 can only analyze hundreds of cells per second in practice, as opposed to a theoretical one thousand cells per second [69]. 1.2 Flow Cytometry (FCM) Bioinformatics Given the array of FCM and mass cytometry improvements, FCM bioinformatics came into the research scene as a new sub-field of bioinformatics in the early 2000s [89, 74]. FCM bioinformatics is focused on computationally storing, organizing, and analyzing high dimensional FCS files. Recent advances in FCS file standards [100], dissemination routes [99], analytical platforms [98, 23], and benchmark data sets [3, 5, 43] have driven an increasing amount of automated tools to complement and possibly replace manual FCS file analysis. Another large driver in automated tool development is the variability that arises through manual analysis [15]. Manual analysis is the current norm in the FCM community leaving room for human errors, subjectivity, and bias [89, 51, 74, 65]. The FlowCAP initiative reports that analysis done manually can have up to 94% more variability than analysis done computationally [37]. Manual analysis is also inefficient given the large amount of time and labour needed to go through each FCS file therefore driving a trend in the FCM community to move from manual analysis to automated analysis of FCS files Applications FCM bioinformatics serves to extract information for two application settings: clinical and research [85, 74]. Examples of these can be seen in Figure Clinical Diagnosis Applications In clinical environments, a typical use case for FCM data is to differentiate between healthy and diseased persons of different stages. In other words, the FCS files need to be classified into the different classes, diseased or healthy, using a classifier. This provides key information for diagnosis. Examples of tools used to do so are provided in Section Clinical Research Applications In research settings, one goal would be to create those classification models. A standard scenario is when there are FCS files that have already been labelled with classes. For the purposes of this thesis, we will call these classes control and experiment(s). In the case that we do know the corresponding effects (control FCS files are from healthy persons and experiment FCS files are from diseased persons), the goal is to: 1. Use the FCS files as input into supervised classification model(s) to create a classifier that would be able to differentiate between FCS files of different classes, and 4

21 Figure 1.2: Applications of FCM Bioinformatics 5

22 2. Identify the main features, or bio-markers, that differentiate between those classes. For example, once the cells in each FCS file have been sorted into their appropriate cell populations, if a certain cell population has significantly more cells in the FCS file from a sick patient then one from a healthy individual, then that cell population can act as a bio-marker to indicate whether or not a person is sick. Examples of tools used to identify these bio-markers are in Section Exploratory Research Applications In the circumstance that we do not know what effects the experiment has on the immune cells, we need to conduct biologically meaningful unsupervised clustering of FCS files to understand whether the experiment(s) has: 1. No effect (are the same as control FCS files), 2. Significant effect(s), or 3. Significant effect(s) that are similar or different from other experiments and the biomarkers that correlate with these effects. The effect referred to in this thesis is a change in the immune system, usually in response to an experiment. To generalize the terminology for all data sets, an experiment is defined as a stimulus or a change in a health condition (e.g. a knocked out gene). To the best of our knowledge, there has not been a formal exploration into this problem, thus this problem serves as the motivation for this thesis. 1.3 Motivation and Contributions More formally, the motivation and problem of this thesis are as follows. Given a set of FCS files, extract features based on properties of the immune cell populations within those files. These features should then be able to produce meaningful distances between the FCS files for effective unsupervised clustering. The motivating scenario for this problem comes from the research initiative, the International Mouse Phenotyping Consortium (IMPC). The IMPC [16] is a collaborative effort to decipher the function behind a target list of 20,000 genes. IMPC is doing so by raising colonies of genetically identical inbred lab mice, setting aside some as controls (wildtype WT) and the rest as experiments (knockout; KO). Knockout refers to the fact that the experimental mice have had a certain gene knocked out, or had its function cancelled out. Among the KO mice, they are further divided into groups (i.e. the mice in one group all have the same gene knocked out, the mice in another group would have another gene knocked out, and so on). All the mice are then raised for 16 weeks before they are euthanized. One assay obtained after their lifetime is FCM, as the immune cells processed through FCM 6

23 Figure 1.3: Applications of FCM Bioinformatics on the IMPC 7

24 are able to reflect the state of the mice s immune system, or their immunophenotype. This is the first time FCS files are being generated in such a large industrial scale totalling 77,000 FCS files in 17 laboratories worldwide over more than 5 years upon completion of the project. The challenge in the IMPC data set is that it is unknown what function many of these genes serve in the immune system. Therefore, one cannot assume that these FCS files belong to known classes many genes could in fact cause no changes to the immune system at all. This provides a situation where we must rely only on the cell features of each FCS file in order to understand whether or not there are actually any differences between two files of different experiments. If there are, another challenge is to isolate the features associated with these differences. See an example of this in Figure 1.3. To answer these questions, this thesis is organized as per below. 1. Chapter 2 explains the different steps involved in the FCM data analysis pipeline and the different tools that have been developed for them. 2. Chapter 3 introduces the data processing pipeline we utilized. 3. We also elaborate on how we analyze each FCS file as a cell hierarchy as defined in [75], a representation of the FCS file in the form of a structured graph incorporating all possible cell populations and the relations between them. We hypothesize that using the cell hierarchy representation of a FCS file would allow us to mine information from the files such that they can be clustered more accurately according to available ground truth class labels of FCS file. 4. From this, we go into the design of multiple features extracted using the cell hierarchy, and how we derived a distance matrix for each feature describing the distance between each pair of FCS files. 5. We then evaluate how well the distance matrices derived from each feature facilitate clustering of the FCS files, into their respective classes. 6. Finally, chapters 4 & 5 present the evaluation results and 7. Chapter 6 provides conclusive remarks and perspectives. Additionally, this thesis outlines several novel contributions: 1. We propose a new problem to cluster FCS files in a completely unsupervised fashion (Sections and 1.3), 2. We design novel features from each FCS file that incorporate properties of the cell hierarchy (Section 3.5), and finally 8

25 3. We show the efficacy of these features on facilitating accurate clustering of benchmark FCS files of known classes (Chapter 4). This work is a preliminary step in furthering our capacity to analyze which genes are functionally similar and may be involved in similar processes through interactions. The significance behind this ultimate goal of understanding the relation between gene functions is that it allows us to understand the immunophenotypic bio-markers and thus model the effects of a certain KO gene. Given the significant insight these models provide into the function of each mouse gene, they provide us with opportunities to investigate human immunodeficiency diseases of unknown origin. 9

26 Chapter 2 Flow Cytometry (FCM) Bioinformatics 2.1 FCM Data Processing Pipeline This chapter briefly introduces the research areas and open questions in FCM bioinformatics that are involved in a FCM data analysis pipeline in Figure 2.1. This pipeline consists of steps to pre-process and identify the cell populations in a FCS file, which can in turn be put to use in real world applications as shown in Figures 1.2 and Pre-Processing In FCM, analysts often opt to first pre-process the FCS files to simplify downstream analysis and amplify data signals. Broadly, the pre-processing stage involves: compensation, transformation, quality control, and normalization [113]. As a reminder, the input here is a FCS file includes a R L matrix where R is the number of cells, L is the number of markers, and its values are the FI of each marker Compensation Compensation is a standard procedure to un-mix fluorescence from different markers [87, 11]. As the fluorescence emitted by different (fluorochromes on) markers can be detected as light belonging to different wavelength ranges, the machine may mistakenly categorize these markers. For example, marker A may be defined by a fluorochrome that emits a fluorescence belonging to a range of wavelength perceived as the colour yellow. However, if its wavelength is on the lower range of the colour yellow, it may be mistakenly detected as being within the range of wavelengths reflecting the colour orange or within the range of wavelength that the fluorochrome on marker B would emit. The amount of marker A s that are mistakenly detected as marker B are false positive observations called spill-over. These 10

27 Figure 2.1: Data Processing Pipeline 11

28 can be accounted for by directly subtracting all spill-overs from the marker B observations [11]. Figure 3.3 shows the effect of compensation on cells through a 2D scatterplot Transformation Many forms of transformation have been proposed for better analysis of FI values in FCS files. Transformation attempts to mitigate several challenges with analyzing FCM data. These challenges include and are not limited to the following [38]. 1. FI values associated with different markers can be on immensely different scales hence making it hard to systematically analyze cell populations across markers. For example, cell population A+ may show FI values 10,000 larger than those of A-. However, cell population B+ may show FI values just 1,000 larger than those of B-. 2. Frequent outlier events or rows on the FCS file matrix exists. These may be debris, mis-handled cells, air bubbles etc. As FI values are in logarithmic scale [92], log transform is traditionally applied to FCS files in order to expand the FI values into linear scale such that the algorithms used downstream in the pipeline would be able to better recognize the data signals present in each file. As such, commonly used transformations in FCM data analysis includes log transform, arcsinh, logicle, and its generalization, biexponential transform [38]. Furthermore, [38] proposes the use of maximum likelihood to optimize parameters for these transformations using the flowtrans package. This thesis expands on logicle transformation and how its parameters are chosen in Section Quality Control Another pre-processing step is to control for quality by checking for and removing technical anomalies in FCS files. As FCM measures cells in a single-file fashion, a steady flow is essential for accurate measurements. However, artifacts, such as air bubbles, clogging, insufficient flow, change in speed, or unwanted particles, can disturb fluorescence signals. Although the flow may quickly return to normal, the anomalous signals would need to be removed. FlowCore [47] provides convenient functions to visually monitor these anomalies over time for each file. Additional packages such as FlowQ [58] and FlowClean [40] are able to remove these time points by using change-point analyses over a linear time trajectory to find time sequences with outlying mean and variance Normalization Finally, it may be necessary to normalize and thus remove batch effects in the pre-processing phase. Batch effects are biological and technical artifacts created when FCS files are created from different facilities, come from different biological samples, are made on machines 12

29 using different settings etc. Tools to remove these unwanted effects include per-marker distribution normalization via flowstats [45], GaussNorm [46], and FdaNorm [36]. These tools work to align FI distribution peaks that are found to be in common between the FCS files. Other methods such as variance stabilization have also been proposed [9], taking inspiration from traditional statistical analysis and RNAseq data normalization. In addition, many cell population identification tools, such as FLOCK [80], use their own normalization methods to maximize the data signals they would be able to mine. Building in custom normalization methods in with downstream analysis provides analysts with a simple one step solution. For instance, the user may input the raw FCS file and directly receive the cell populations within the file as output, rather than having to work with multiple tools and packages Cell Population Identification The process of identifying cell populations in a FCS file is one where the cells in a file are sorted into its different cell populations (as defined in Section 1.1). Identifying cell populations is one of the most important steps in the data analysis pipeline with over 50 tools developed for this purpose [57]. Taking the pre-processed FCS file (R L matrix) as input, the tools that can identify cell populations, or clusters of cells, is broadly categorized into three types (see Table 2.1): 1. Supervised classification, 2. Unsupervised clustering, and 3. Automated gating, termed a parent type of unsupervised clustering of cells in FCM bioinformatics Supervised Classification Supervised classification of cell populations is an open question with various challenges. The formal problem is: given training FCS files with all of their cells labelled with a ground truth cell population class label, train a classifier such that it can classify cells of unknown FCS files into their respective cell populations as accurately as possible. The first challenge is a lack of standard panels, or a set of standard marker sets used to analyze FCS files. Currently, most panels are tested and created per experiment per laboratory. With different panels, the FI values and the unique combination of markers that define a certain cell population can vary greatly. To overcome this, organizations are starting to create standard panels (e.g. for leukaemic diagnosis [106], multi-national project IMPC [1], the Human Immunophenotyping Consortium (HIPC) [64]). Nevertheless, panel designs, especially in experimental settings, are still largely up to the biologist conducting the experiment and the biological sample in hand. Thus, aside from past experiments in a local laboratory, no public sets of training files are available for specific experiments. 13

30 Tool DeepCyTOF [60] FlowGM [24] HDPGMM [28] ImmunoCLUST [97] SWIFT [72] FLAME [79] FlowClust [62] FLOCK [80] FlowPeaks [42] ACCENSE [94] ClusterX [21] DensVM [13] X-shift [90] PhenoGraph [59] CLARA [101] SamSPECTRAL [114] BayesFlow [55] ASPIRE [32] FlowDensity [66] OpenCyto [35] Type Supervised Classification Unsupervised Clustering (Mixed-model-based) Unsupervised Clustering (Density-based) Unsupervised Clustering (Graph-based) Unsupervised Clustering (Bayesian-based) Automated Gating Table 2.1: Examples of Cell Population Identification Tools categorized into three types: Supervised Classification, Unsupervised Clustering, and Automated Gating 14

31 A second challenge is how easily FCS files can vary depending on external factors such as batch effects. As mentioned previously, FI values are largely influenced by things such as machine settings and sample handling. Tools exist to normalize small systematic changes (as introduced in Section ). Alignment, sensitivity, and fluidic quality control beads can also be used to normalize mean FI s and standardize machine settings. However, it is still difficult to normalize large changes (e.g. the testing FCS files could be analyzed on different flow cytometers, in a different laboratory). The variability between FCS files can pose a problem when training a classification model, as the model tends to overfit on the few training files provided. Tools such as DeepCyTOF [60] (CyTOF being mass cytometers by DIV Sciences) are starting to emerge in an attempt to mitigate these challenges. DeepCyTOF is a deep learning solution that first uses a deep learning network model to unify the distributions of all FCS files (alternative methodological reference for this step can be found in [93]). It then trains another deep model on a single manually gated FCS file, which would then, classify the cells from the FCS files whose distributions are modified by the first model Unsupervised Clustering Another approach in cell population identification is unsupervised clustering of cells. Again, unsupervised clustering clusters the cells in a FCS file into cell populations. Clustering Methodologies Mixed-model-based Clustering: One clustering procedure in FCM is to fit the cells on some type of mixture model and then proceed with clustering. FlowGM [24], HDPGMM [28], ImmunoCLUST [97], SWIFT [72], FLAME [79], and Flow- Clust [62] all model the FI values of cells as a variant of the Gaussian mixture model (GMM). They then use expectation maximization (EM) to optimize the GMM to generate initial cell populations. Density-based Clustering: Another property used is the density distribution of the FI on each marker. FLOCK [80] and FlowPeaks [42] both directly use those density distributions on the original data set to define the shape of the clusters. Misty Mountain [103] also looks at the density distributions but it does so by shedding down density contours. As these density-based procedures, such as density contouring and local maxima searching, can become intractable in higher dimensions, density-based tools, such as ACCENSE [94], ClusterX [21], and DensVM [13], pre-process the FCS file by lowering the number of dimensions using the t-sne (t-distributed Stochastic Neighbor Embedding) [63] projection 15

32 of the original data. This pre-processing allows for faster and more accurate density-based clustering results on specific data sets [110]. Graph-based Clustering: X-shift [90] ties in both density-based clustering and graph-based clustering. It first defines the clusters via density local maxima on a weighted K-nearest neighbour density estimation (K-NN DE) and then connects those clusters on a graph. The clusters that are close together on this graph, the communities, are subsequently merged. Similarly, PhenoGraph [59] also clusters the cells based on communities except it skips the initial clustering phase and directly lay out each cell as a node on a K-NN graph and then connects the cells. CLARA [101] is another method that represent each cell as a node on a graph. It utilizes a force-directed weighted graph with edge weights based on cosine distances between the median FI of cells. It then clusters these cells using scaffold maps. In addition, spectral clustering of cells has also been implemented as a tool called SamSPECTRAL [114]. Bayesian-based Clustering: A few methods have proposed the use of Bayesian statistics in cell clustering. For instance, BayesFlow [55] uses a Bayesian hierarchical model, with or without explicit priors, to create lots of little cell clusters of which are then merged. Similarly, ASPIRE [32] uses a non-parametric Bayesian approach to cluster cells. Rare Cell Population Identification Rare cell populations are often missed during clustering because of how few cells they contain. This is because many methodologies pass rare cell populations off as outliers, debris, or as a part of a larger cell population [74]. Splitting and then merging cell population as a refinement procedure to obtain accurate rare cell populations is a tactic that can be seen in many methodologies. One such method is SWIFT [72]. SWIFT uses an iterative approach where it takes a user input k as the number of clusters it should expect. It samples cells from the original FCS file to reduce the input size and it does so repeatedly to avoid missing out on rare cell populations. It then models these sampled cells as a GMM and optimizes for this model until all clusters are fixed and a better parameter k is generated. It further refines this parameter by multi-modality splitting and agglomerate merging of the said clusters. Finally, it uses the refined k and conducts another round of soft clustering extracting rare cell populations via a Hierarchical Dirichlet process model similar to HPDGMM [28]. Other methods using this strategy of splitting and merging cell populations include FlowClust/FlowMerge [34], BayesFlow [55], and immunoclust [97]. 16

33 Cell Population Matching and Labelling After the cells in each FCS file have been clustered into their respective cell population, it may be necessary to match and then label these cell populations across several FCS files to make them comparable. Clustering tools such as HDPGMM [28], ASPIRE [32], FlowGM [24], ImmunoCLUST [97], and SWIFT [72] simultaneously cluster cells and match them across FCS files. Other tools, such as flowmatch [10], maps cell populations across FCS files using the mixed edge cover algorithm after clustering has already been done for all the files Automated Gating Gating is the process of manually identifying cell populations within a FCS file. As it is difficult for the human eye to analyze more than 3 dimensions at once, the cells are drawn out on multiple 2D scatterplots analyzed on two markers at a time. The order in which these markers are analyzed is laid out in an instruction manual called the gating strategy. In manual gating, a human expert would follow the gating strategy, lay out the cells on the instructed scatterplots, and gate the cells. The verb gate here, is the drawing of borders around regions of cells plotted on the 2D scatterplots. These borders encircle cells that belong to the same cell population (i.e. cells with similar FI values). Automated gating utilizes this expert created gating strategy and gates on the same 2D scatterplots as one would do manually. But instead of a human drawing borders around target cell populations, automated gating uses FI density distribution patterns to find those borders. Tools created for this category include FlowDensity [66] and OpenCyto [35] a more detailed explanation of gating and FlowDensity is in Section Cell Population Visualization Techniques have also been proposed to visualize the distribution of cells in a FCS file by reducing the dimensionality down to a human analyzable 2/3 dimensions. Traditional dimensionality reduction methods used include PCA [111] and t-sne [63] (as a part of visne [8] and one-sense [25]). Another way to visualize these cells is to first cluster the cells into cell populations and then arrange those cell populations out on a 2D surface. Tools that do so include SPADE [81] and FlowSOM [107]. SPADE agglomerativly clusters a downsampled set of cells and then organizes them as a minimum spanning tree (MST), while FlowSOM uses self-organizing maps to organize the orientation of the cells with options to display clusters of cells connected in a MST. Clustering tools such as PhenoGraph [59] and CLARA [101], and post-clustering analysis tools such as the flowtype/rchyoptimyx pipeline [75] and FloReMe [108] also have visualization capabilities. Figure 3.2 in Section 3.1 shows examples of visualization outputs from t-sne and Flow- SOM. 17

34 2.2 Applications This section goes over the current state of literature aimed at applying FCM data to the application scenarios described in Section and in Figure FCS file Classification A FCS file classifier (as used in Section and as described in Section ) can be trained on raw FCS files, or processed FCS files and their cell populations found by tools from Section An example tool of the latter is Team21 [5]. Tools belonging to the latter type of classifier include the FlowCore/FlowStats statistical pipeline [45] and flowmatch [10]. FlowMatch creates a hierarchical models by agglomeratively meta-clustering or matching the cell populations across the training FCS file of a given class. This step outputs a classifier containing a hierarchical model per class. When given a processed FCS file and its cell populations, flowmatch calculates a similarity index between it and the cell population models of different class. FlowMatch then labels this FCS file with the same class as the model it is most similar to. Most, if not all classification tools follow this two step process of first creating a classifier and then classifying files. The difference between the tools is the model that is used to define the classifier Biomarker Cell Population Identification Not only do we want to classify the FCS files, we also need to understand the bio-markers or cell populations that correlate with and signal a difference between the different classes of files (as described in Section ). In other words, given that the cell counts of each cell population represents the features of a FCS file, then the bio-markers are those cell populations whose cell counts are significantly different between files of different classes. The goal of feature selection here is to find those populations. This step can also be done after cell populations have been identified. Pipelines that have integrated this process include Sam- SPECTRAL [114], FloReMi [108], gem/gann [105] the flowtype/rchyoptimyx pipeline, the flowtype/fealect (Feature Selection for Sample Classification) pipeline [75], Citrus (hierarchical clustering) [17], and COMPASS [61]. The latter four finds rare cell populations or possible bio-markers first by over-clustering the cells. They then pick out and/or merge the cell populations whose features correlate with FCS file class labels and highlight them as bio-markers. Another tactic is to use survival or competitive-based models where cell populations that do not correlate to any class labels are eliminated from being listed as candidate bio-markers. Methodologies that incorporate such a model include the previously mentioned COMPASS [61], Competitive SWIFT [83], and flowclust/survival-model pipeline [12]. 18

35 2.3 Remarks FCM bioinformatics is a growing sub-field of bioinformatics. It tailors to the needs of and answers questions posed by the life science community in an automated and more efficient manner. The research areas in this field are focused on creating tools that come together in a FCM data analysis pipeline. These tools pre-process FCS files, and identify cell populations defining the immunophenotype of each FCS file. There are also tools available for conducting analyses on those results to provide biologically meaningful insight into the immunophenotype of the FCS file(s) at hand. Chapter 3 describes one of those tools or pipeline, whose purpose is to extract informative features from those immunophenotypes. 19

36 Chapter 3 Data and Methods This chapter covers the implementation of a generic data processing pipeline mentioned in chapter 2. This pipeline is used to process the IMPC data set from [1] and is displayed in Figure 3.1. We also apply a similar pipeline to process a benchmark data set from FlowCAP-II [5]. We then give a definition of the cell hierarchy [75], and elaborate on how we use it to extract features from each processed FCS file. Afterwards, we describe how a distance matrix and clustering of FCS files is derived per feature. These are then used to evaluate how well each feature can facilitate clustering of FCS files into informative clusters. It is also important to keep in mind the ultimate goal of extracting informative features and subsequently, clusters of FCS files in the context of our overarching motivation to identify KO genes from IMPC that are similar or different from each other, which may indicate possible gene interactions. All of the tools mentioned are freely available on the Bioconductor [43] platform. All methods described are implemented using the language, R. 3.1 Data The two data sets experimented on are from FlowCAP-II (Critical Assessment of Population Identification) [5] and IMPC (from data generated in the Sanger Centre [1]) IMPC The data processing methodology prior to feature extraction is put together for the IMPC data set from the Sanger Centre [1], on standard IMPC panels. The example images displayed in this thesis contains FCS files analyzed on Panel 2 from biological samples extracted from the mouse s spleen organ. Panel 2 contains 10 markers. In total, there are 2506 FCS files, one FCS file per mouse, each file having about 300,000 cells. 20

37 Figure 3.1: Data Processing Pipeline 21

38 Figure 3.2: Sample FlowCAP-II and IMPC FCS files and their Cell Populations Visualized using t-sne and FlowSOM (FlowCAP-II FCS file cell populations identified using gating strategies [26, 95, 96]) 22

39 Here, we define the term, variable, as an attribute of the biological sample that a FCS file is analyzed for. The main variables influencing the FCS files are: 1. Gene: Each FCS file is from a mouse that had a single gene (or no gene) knocked out. 564 FCS files are from control WT mice, and 1942 files are from experiement mice with different genes knocked-out (KO). Each KO gene has at most 6 FCS files usually distributed over different days and genders. 2. Date: The initial set of KO and WT FCS files used in this thesis are created over the days to note there are days where KO FCS files were created but WT FCS files were not. 3. Gender: There is clear evidence of gender dimorphism in the mouse s immune system [1]. 4. Centre of FCS file creation: Files generated at different centres can be drastically different depending on how the mice are raised, the time zone, how the biological samples are transported, preserved etc. In fact, the centre of file creation is the largest source of variability between FCS files. The variable that we are most interested in is gene. The problem here is to cluster the FCS files into its respective KO gene or WT clusters with the exception that FCS files from mice whose KO gene has no effect on the immune system should be clustered with the FCS files from WT mice. As such, we obtain a multi-cluster clustering problem. In this thesis, we will not be including results for the IMPC data set beyond the preprocessing and the cell population identification pipeline. What makes evaluating feature designs here more difficult is that we do not know what, if there is any at all, effects each KO gene has on the immune system. Therefore, we do not know how many real clusters there should be. Even if we do know how many clusters there are, we are not certain of how homogeneous, and how well separated the clusters should be. Our goal for IMPC, however, is not necessarily to obtain high clustering or classification accuracy against a given ground truth. In contrast, what we want to do is to simply obtain a distance matrix of which can be used to get an idea of how KO genes relate to each other. Nevertheless, in this thesis, we want to first ensure our pipeline can obtain reliable distances before it is applied on the IMPC data set. In the following sections, we will refer to the different types of FCS files on these different variables, FCS files of different classes. For example, if we are referring to the gender variable, FCS files from female mice are of one class and FCS files from male mice are of another class. Furthermore, we will refer to the WT and KO FCS files as control and experiment FCS files respectively. 23

40 3.1.2 FlowCAP-II AML The FlowCAP-II AML data set [5] is the benchmark data set we will use to extract and test features on. It is available on FlowRepository [99]. The data processing pipeline used on FlowCAP-II consists of the same steps as the one used on the IMPC data set. Within the data set, 8 files are extracted per human individual, one which acts as a control (which will not be considered in this thesis) and the other 7 are analyzed on different panels, mixed with different sets of 5 markers and FS/SS (note: referenced papers may also refer to these panels as tubes ). The FlowCAP-II data set contains a total of 2,513 FCS files from 43 acute myeloid leukemia (AML) positive patients and 316 healthy individuals. Each file consists of approximately 60,000 cells. In summary, the ground truth variables this data set contains are: 1. Panels: the different panels on which the files are analyzed. These are not comparable and thus different from each other. 2. AML positive vs Healthy individuals: the FCS files from patients with and without AML should be different from each other. An important note is that AML positive patients have a larger CD34+ cell population than healthy patients. CD34 is a marker usually applied to stem cells, and are expressed on AML positive blast cells, all cells within the immune system [109]. This provides us with a problem to generate features that would facilitate accurate clustering of these two variables. Again, the FCS files of different types (i.e. analyzed on different panels, from an AML positive or healthy patient) are referred to as being FCS files of different classes. For consistency, we will refer to the healthy individual s files and AML positive patients files as control and experiment FCS files respectively. 3.2 Pre-Processing Before the FCS files can be analyzed, several pre-processing steps are performed. For these steps, the input is a FCS file, or the R L matrix cocntaining raw FI values. The output is of the same dimensions, but with pre-processed FI values Compensation Compensation is a standard procedure that occurs as a first step in the data analysis pipeline, previously described in Section As shown in Figure 3.3 between steps 2 and 3, compensation helps to un-mix cells whose markers may have been detected incorrectly. 24

41 Figure 3.3: FCM Data Progressing Pipeline: Pre-processing; Cells of a FCS file are plotted on markers CD5 and CD11b, with the colours representing the density 25

42 3.2.2 Transformation After the FCS file has been compensated, the cells with maximum or negative FS and SS values are first removed. This is because cells with abnormally small or large FS and SS values can be interpreted as debris or large non-biological particles. Data transformation is a procedure that changes the distribution of data (FI values). This is done to allow downstream analysis tools to detect signals that may otherwise have been hidden. In this pipeline, we use logicle transform [76] whose effects can be seen in steps 1 and 2 of Figure 3.3. This step is done only for the FI values and not the FS and SS values. The logicle transform is another name for the parametrized biexponential function: S(x; a, b, c, d, f) = a exp(bx) c exp( dx) + f, a generalization of the hyperbolic sine function: sinh = exp(x) exp( x) 2 where x is the FI value to be transformed. Logicle transform has nice properties that spread data out like a log transform but also maintains near linear scales around 0 [76]. In flow cytometry, there are several considerations that can be used to simplify the parametrized biexponential function. Using the FCS files available, the following can be defined: 1. T is the maximum value of the original FI value x to be analyzed. 2. m indicates the upper bound of the transformed FI value (e.g. m = 4 ln(10) means that the transformed data will fall between 0 and 4). 3. w adjusts the strength of linearization around 0. Plugging these into the biexponential function we get: S(x; w) = T exp( (m w)) (exp(x w) p 2 exp( x w (x w) ) + p 2 1) p where p can be derived exclusively from parameter w via w = 2p ln(p) p+1. For this thesis, w is derived by a global frame (see details in [76]), created by sampling 1,000 cells from each FCS file. Then the same parameters are used to transform all FCS files. 26

43 27 Figure 3.4: FCM Data Processing Pipeline: Quality Control

44 3.2.3 Quality Control As mentioned in Section , quality control is a necessary step to delete anomalies in the FCS files. In this pipeline, we use a cleaning tool developed at the Brinkman Lab, Terry Fox Lab, BC Cancer Research Agency. The pipeline takes as input, the transformed FCS file. In this thesis, we divide the total time used to analyze a FCS file into 500 equally long time intervals. We term these time intervals as bins. As a precursor step, we identify time bins when the amount of cells the flow cytometer measured is less than 10% of the amount of cells measured during the time bin that measured the most cells. These time bins usually occur for a brief amount of time at the beginning, end, or during errors in the midst of the flow process. The cells measured during these time bins are removed. For each bin, the 5th, 20th, 50th, 80th, & 95th percentiles are recorded along with the mean fluorescence value, and the 2 nd and 3 rd central moment. All of these values are then tested for outliers separately. Here, the definition of an outlier is any bin with a value that is outside of the 3rd upper or lower standard deviation around the value with the maximum frequency. In general, this value would be around the mean. For any time bin, If more than 4 of the above values are marked as outliers for any marker, all cells measured during that time bin are removed. This step is shown in Figure 3.4 where the cells are plotted for each marker over time (at which a cell is analyzed by the machine). The colour represents the density while the axis shows the FI values for the corresponding markers. 3.3 Cell Population Identification Automated Gating Taking a cleaned FCS file as input, the next step is to cluster each cell into its respective cell populations via automated gating, flowdensity [66]. Again, gating is the process of manually identifying cell populations within a FCS file on multiple 2D scatterplots of cells analyzed on two markers at a time. The order in which these markers are analyzed are laid out in a gating strategy. Figure 3.5 illustrates a gating strategy and its importance using a toy example. Note that while this one only has two steps, a gating strategy usually consists of many steps or 2D scatterplots before all the cell population borders are defined. In this thesis, the definition of a border is equivalent to that of a gate a FI threshold value for a marker. Threshold values are defined such that they separate cells into cell populations that contain either cells with a FI value greater than, or lower than the threshold(s). Let us suppose we are given a first 2D scatterplot on marker CD11b and its threshold b. The cells here 28

45 Figure 3.5: FCM Data Processing Pipeline: the gating strategy 29

46 Figure 3.6: FCM Data Processing Pipeline: Automated gating and its ability to reduce variance caused by residual and unknown variables [1] would be divided based on whether their marker CD11b FI value is greater or less than the threshold. If a cell has a greater FI value than b, it is marker CD11b positive or a part of the CD11b+ population, otherwise it is marker CD11b negative or is a part of the CD11b- population. After the first gating, only CD11b+ cells are used to form a second scatterplot on markers CD8 and Ly6C. This scatterplot is used to establish thresholds a and c. In this case, the human expert designed the gating strategy this way because he/she has determined a biologically valid threshold a to be the valley in the second scatterplot (plotting cells in population CD11b+) of Figure 3.5. If the gating strategy is not used, then a (as shown by the red dashed line) on the first scatterplot can easily be misinterpreted as the threshold. Therefore, this project opts to use a gating strategy in order to incorporate expert knowledge and allow for comparability across files. The panel 2 gating strategy used for the IMPC Sanger Centre data is the one used in [1]. Instead of gating by hand, we follow the gating strategy and find the thresholds using R package flowdensity [66] the automated part of automated gating. flowdensity is set according to the gating strategy such that it finds a threshold based on common density distribution scenarios. The three most common density distribution scenarios and their thresholds are: 1. Bimodal (or bimodal after smoothing and/or selection of two target modes, i.e. peaks on a density distribution): set a threshold on the valley. 30

47 2. Unimodal: set a threshold on the left or right side of the peak at the point where the slope changes most rapidly, or where the density curve s third derivative is maximal. When completed, this step outputs L marker thresholds per FCS file, one for each marker. These thresholds may differ slightly between files. The rationale for gating in an automated fashion rather than manually is shown in Figure 3.6 [1]. It shows that automated gating procedures decrease the variance caused by unrecorded residual factors, such as human bias, and amplifies those caused by experimentally interesting variables, such as date of FCS file creation, gender of mice for which the FCS file is created, and weight of the original biological specimen. For more justification on using automated procedures, see Section Cell Population Enumeration Now the ingredients for cell population enumeration are ready. Cell population enumeration by FlowType [75] takes as input, the L thresholds and the pre-processed FCS file s R L matrix, and outputs a length m = 3 L vector. This allows all the FCS files from a data set to be collated into a n m matrix where n is the number of FCS files. To illustrate what FlowType does, suppose we have a FCS file with cells analyzed on markers CD8, CD11b, and Ly6C. FlowType goes through the file and counts how many cells are positive for CD8, or are a part of the CD8+ population. This is the cell count for CD8+. The same is done for CD8- (which is the total number of cells minus the cell count for CD8+), CD11b+, CD11b-, Ly6C+, and Ly6C-. These cell populations are marked with one marker and therefore preside in layer l = 1. The same is then done for each of these cell populations child cell populations. For example, two of CD8+ s child cell populations are CD8+CD11b+ and CD8+CD11b-, where CD8+CD11b+ contains cells that show positive for both markers CD8 and CD11b. A child population may have multiple parent populations and it never has a larger cell count than any of its parents. Reversely, the cell count of parent cell population CD8+ is a sum of its pair of child populations CD8+CD11b+ & CD8+CD11b- or CD8+Ly6C+ & CD8+Ly6C-. Again, the cell populations marked with two markers are in layer l = 2, and so on. FlowType does this until layer L, outputting a length 3 L cell count vector per FCS file [4]. This process is similar to the process of frequent item-set enumeration in frequent pattern mining [2]. Figure 3.7 shows htis toy example as a cell hierarchy Cell Hierarchy The cell hierarchy is a representation of this length 3 L cell count vector. It is a structured collection of all possible cell populations. To formalize this hierarchy, we use the notations as per below. 31

48 32 Figure 3.7: Cell Hierarchy: A Representation of the FCS file

49 Each FCS file i contains a cell hierarchy in the form of a structured directed acyclic graph G = (V, E). Its elements are as follows. V = {v 1,..., v j,..., v m } are the nodes v j = (mar j ) or the cell populations. mar j is a set that contains 0 l L marker statuses from l = mar j marker subsets of Ma: Ma = {{ma 1 } 2,..., {ma r } 2,..., {ma L } 2 } where {ma r } 2 = {ma r +, ma r } such that a cell can show up either as positive or negative status for a marker but not both (e.g. a cell population could be mar j = {ma 1 +, ma 3, ma 5 +}, but cannot be mar j = {ma 1 +, ma 1, ma 5 +}). The total number of nodes is: { ( L L V = m = 3 L = 2 l l=0 l )} ) Among these, there are 2 l ( L l nodes in each layer l, Such that each node in layer l has 2l(L l) 1 siblings (i.e. nodes that share at least one marker status or parent node). Each node has: v j in = l incoming edges, or parent node(s), and v j out = 2(L l) outgoing edges, or child node(s). There is one root node. A root node is on layer l = 0 and only has outgoing edges. There are 2 L leaf nodes. A leaf node is on layer l = L and only has incoming edges. The cell hierarchy contains edges E i = {e jk } only between nodes v j, v k that have a direct child/parent relationship. In other words, an edge e jk = (v j, v k ) is defined by the nodes it connects such that The edge points from node v j to v k, where v j is a parent node of v k (v j v k, i.e. the cell population of node v k is a sub-population of, or contained within the cell population of node v k ), and mar j = mar k 1. The total number of edges is: { ( L L E i = s = l 2 l l=1 l )} = 3 L 1 2L = 3 L 2L 3 = V i 2L 3 33

Figure 3.8: FCM Data Processing Pipeline: Cell Count Normalization (Example 1) We will refer to each node or edge in the set of nodes V or edges E as an attribute of their respective element. 3.4 Cell Count Normalization Once the n m matrix is collated, we have obtained our first feature: the absolute cell count for each node (cell population) in a single FCS file (i.

50 Figure 3.8: FCM Data Processing Pipeline: Cell Count Normalization (Example 1) We will refer to each node or edge in the set of nodes V or edges E as an attribute of their respective element. 3.4 Cell Count Normalization Once the n m matrix is collated, we have obtained our first feature: the absolute cell count for each node (cell population) in a single FCS file (i.e. FCS file i can be described by the cell count feature, a vector of cell counts x count i = {x count i 1,..., x count i j,..., x count i m }, where each element in this vector x count i j is the cell count of node (cell population) v j ). However, before the cell count features can be compared across FCS files, they must be normalized. Normally normalization is done per sample, by converting it to a proportion: countp rop xi = xcount i xcountt otal i countp rop where xi is the normalized version of xi count obtained by dividing its original cell count xi count over the total number of cells in FCS file i, xcountt otal i yielding a proportion value. Although popular, percentage values may be misleading when analyzing cell production changes between different classes of FCS files [44]. To illustrate the issue with using proportions, Figure 3.8 shows a hypothetical scenario where we sample 20 cells from a WT mouse and a KO mouse. In reality, knocking out a gene causes the mouses immune system to double the production of total immune cells, via a tripling of the CD8+ cell population. If the same number of cells are sampled from both mice and we use the associated proportion values, we may misinterpret the effect of knocking out a gene to also being a decrease in production of cell populations CD11b+ and CD8+Ly6C+. To prevent such misinterpretations, we use a modified cell count normalization method called the trimmed mean (TMM) [86] but instead of it being based on percentage values, we base it on the absolute cell count. First, a reference file is used xref count in this case, a FCS file that is from a control with a median total cell count. Each of the other non-reference files i then divides its absolute 34

Figure 3.9: FCM Data Processing Pipeline: Cell Count Normalization (Example 2) cell counts xi count point by point over that of the reference file to obtain a vector of ratios x count i/ref.

51 Figure 3.9: FCM Data Processing Pipeline: Cell Count Normalization (Example 2) cell counts xi count point by point over that of the reference file to obtain a vector of ratios x count i/ref. We then convert these ratios into log scale: ( ) x count i t i = log 2 xref count A sample visualization of these ratios can be seen in Figure 3.9 Before obtaining the TMM from these ratios, we weigh them by their cell counts. In FCM, very small cell counts can sometimes be attributed to noise or minor errors in previous steps in the pipeline. For instance, as we are unable to sample large amounts of cells from a rare cell population, we see larger variance in its cell count see Figure 3.9. Hence, we are less confident when statistically analyzing the cell counts of rare cell populations. An example of error in gating is as follows. In Figure 3.5, the cells are spread out as a distribution rather than precise clusters. Therefore, a slight change in the gates could cause the cells near the gates to be categorized into completely different cell populations. A small number of cells mis-classified would not effect larger cell populations as much as it would effect rarer cell populations. Hence, we use an optional weight vector, the inverse asymptotic variance w i, to reduce the influence of rare cell populations on the normalization factor. w i is calculated using the delta method [18] based on how large the cell counts are: 35

52 (xcountt otal ref xref count ) x ref w i = + xcountt otal ref x count (xcountt otal i i xcountt otal i ) x count i Also calculated based on cell counts, we produce an additional weight vector z i : z i = 1 2 [log 2(x countp rop i ) + log 2 (x countp rop ref )] In this thesis, we disregard cell counts of all populations j with z ij default value of α = 10. < α where α is a Finally, we combine the weights and the ratios to obtain the TMM. The idea is to assume that most nodes cell counts are not significantly different between FCS files. In other words, the number of cells produced for most cell populations do not differ significantly and that an experiment only correlates with the significant change in production of a minority of cell populations. Then, the cell counts that are approximately the same with respect to the reference sample must be the ones where its ratios occur at the highest frequency across all nodes. This ratio is the TMM f i and is obtained by taking a weighted mean of the ratios t i : g i = t ij {j z ij <α} w ij {j z ij <α} 1 w ij f i =.5 g i As such, we obtain a TMM for each FCS file. To normalize the cell counts, for each non-reference file i, we multiply all of its cell counts with its TMM f i (see the blue line in Figure 3.9): x count i = f i x count i Note: if the cell count of a single node v j changes, this change would affect the cell count of its parent nodes (and the parent nodes of those parent nodes, i.e. ancestors), whose cell count is a sum of its child nodes cell counts. Therefore, the same change also implies a change in cell count amongst its child nodes (and the child nodes of those child nodes, i.e. descendants). As a result, a single node s cell count change would mean a change in cell count for a maximum of l 2 ancestor nodes and (L l) 3 descendant nodes, where l is the cell hierarchy layer on which v j presides. Taking this further, we can also say that all cell production changes only occur in cell populations nodes on the last layer of the cell hierarchy (i.e. the leaf nodes). In turn, all cell count changes in the nodes on higher layers are simply results of leaf node cell count changes. Therefore, the assumption that the production of cells for most cell populations are not effected by the experiment still holds 36

53 when the total number of leaf nodes affected, along with L 2 ancestor nodes each, remain a minority of the total number of cell population nodes on the cell hierarchy. 3.5 Feature Design After obtaining a normalized n m matrix, a series of features are extracted for each row, or FCS file, in an attempt to expose meaningful signals within each file. The features are either values associated to the nodes or the edges of the cell hierarchy the normalized cell count xi count being one feature associated to the nodes. In this thesis, we extract and explore nine features for each FCS file i. Formally, each feature is represented by vector x i. If the feature is associated with the nodes, the feature vector includes a value for each node x i = {x ij } such that the value associated with node v j is x ij. If the feature is associated with the edges, the the feature vector includes a value for each edge x i = {x ijk } such that the value associated with the edge e jk, connecting nodes v j and v k, is x ijk. Furthermore, this thesis considers two types of features for both nodes and edges. The first type of features is the absolute features. These represent a FCS file as a whole. The second type of features is the phenodeviant features. These are created by comparing the experiment FCS files against the control FCS files such that the features only describe FCS files in terms of the effects correlated to the experiment. Again, we only implement the rest of the methodologies in this section on the FlowCAP- II benchmark FCM data set, prepared in [66, 74]. This is because the IMPC data set would require methodologies to deal with the fact that we have no ground truth class labels, which is out of the scope of this thesis. Nevertheless, we will use IMPC to give intuition and context into the methodologies used. Here, the edge features represent the structural component of the cell hierarchy representation of a FCS file. A main goal and hypothesis in this thesis is to show that features incorporating the cell hierarchy, the edge features, can reveal data signals that would facilitate the creation of distance matrices that space out control and experiment FCS files. Another hypothesis we present is that isolating phenodeviant features should provide us with even more informative signals regarding the relation between experiment FCS files Absolute Features Absolute features are features directly extracted from a FCS file. Utilizing the cell hierarchy, below are the absolute features of a FCS file described separately in terms of whether they describe the nodes or the edges. Examples of absolute features can be seen in Figure Node Features x <feature> i = {x <feature> i j } are features that prescribe at most one descriptive value to each node in a cell hierarchy. A feature that has been 37

54 Figure 3.10: FCM Data Processing Pipeline: Feature Design (Examples for Absolute Features) presented so far is the normalized cell count. Hence a FCS file s node feature would contain at most m = 3 L or O(3 L ) values. For each node v j following. in FCS file i, its absolute features x <feature> i j 1. (CountAdj) is the normalized cell count: include the x count i j Note that the normalization does not affect its cell hierarchy properties, as all the cell counts in a single FCS is multiplied by a single normalization factor. 2. (Child_entropy) is the average entropy of the proportions of v j s child nodes cell counts over its cell count. The rationale here, is that during manual gating, smaller and smaller cell populations are isolated. This allows us to understand how the make up of a specific cell population may have changed locally rather than globally. This feature is calculated as: x entropy i j = 2 k j child (x prop i jk j child ln(x prop i jk )) x prop i jk = xcount i k x count i j 38

55 where j child are the set of direct child nodes of v j. Note that this can only be calculated for non-leaf nodes. Therefore, this feature is of length x entropy i = 3 L 2 L. 3. (Child_entropy) can also be extended to (Parent_entropy), where we obtain the entropy of the parent nodes cell count in proportion to their child s. The only changes that need to be made, is to replace j child with j parent such that j parent is a list of all nodes that are parents of v j. In this case, x entropyp i = 3 L 1 is the number of non-root nodes with there being only one root node (the cell population containing all cells). As with entropy, edge features x <feature> i = {x <feature> i jk } highlight the relation between child and parent nodes. Edge features prescribe at most one value to each single edge, or group of edges. Therefore, a FCS file s edge feature would contain at most s = 3 L 2L 3 values. For each edge e jk in FCS file i, its features x <feature> i jk include the following. 1. (Child_prop) is a ratio of the cell counts of nodes that the edge connects. Here, it will always be a child node s cell count x count k cell count x count j : x prop i jk = xcount i k x count i j over its parent node s 2. (Child_pnratio) is the ratio of a cell populations positive and negative versions: x pnratio i jr = xcount i k x count i k where mar j mar k, mar k, mar k = mar k, and {mar k \mar k, mar k \ mar k } = {ma r } 2 for any marker r such that r mar k, r mar k =. As such, this feature contains a value for every two edges, e jk and d jk, thus each file has a feature vector of length s Phenodeviant features Phenodeviant features are defined as features containing information on how significantly different the experiment FCS files cell population nodes or edges are from those of the control FCS files. These features help identify the immunophenotypic differences between experiment FCS files and control FCS files, and can act as a filter to remove insignificant feature attributes (i.e. nodes/edges). To get an idea of such features, an example of a phenodeviant node feature is the (LogFold) here we use a natural logarithm ln. Suppose for node A, the experiment file 39

56 indicates a normalized cell count of 10, while the control files have a mean normalized cell count of 50. Then the LogFold feature value for node A of the experiment FCS file would ( ) be 1.61 = ln On the other hand, the phenodeviant feature would not be necessary for a control FCS file because we know it should not be any different from the other controls. More features are presented later in this section. IMPC: For the IMPC Sanger Centre data, the controls are the FCS files from the WT mice, while there are multiple types of experiments, or mice with different genes knocked out. In the IMPC data set, there are several confounding factors that need to be accounted for prior to extracting the phenodeviant features. One of those, is the variable date. An example of how to reduce its effects is as follows. We first separate the FCS files based on date. The WT files are plotted out by date of creation and then segmented into groups. These segments are separated on dates where the total normalized cell count of the FCS files change drastically. These changepoints are detected by the method [22], on the modified Bayes information criterion (MBIC) [116]. Only a maximum of 70 WT files, created on dates closest to the date the KO file was created and is in the same date segment as the KO file, is used to create the KO file s phenodeviant features. As the phenodeviant features reflect how different an experiment FCS file is from the control FCS files, we only produce phenodeviant features for the experiment files. In other words, for each experiment KO FCS file, we obtain additional phenodeviant features on top of their absolute features. FlowCAP-II: To keep consistent terminology, a reminder that the controls in the FlowCAP-II data set are the FCS files extracted from healthy individuals and the experiments are the FCS files from AML positive patients. Since we only have one type of experiment, we simulate the IMPC scenario of multiple experiments by randomly assigning half the healthy patients as the controls and the other half as an experiment that should be different from the AML FCS files. As such, the phenodeviant features are extracted for half the healthy patients and all the AML patients. Additionally, each experiment is only compared against the control FCS files that are analyzed on the same set of markers, or panel, as the experiment FCS file in question. Examples of phenodeviant features can be found in Figure The following describes node-based phenodeviant features x <feature> i = {x <feature> i j }. For each node v j in experiment FCS file i, its features x <feature> i j compared 40

57 Figure 3.11: FCM Data Processing Pipeline: Feature Design (Examples for phenodeviant Features) 41

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology 9/9/ I9 Introduction to Bioinformatics, Clustering algorithms Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Outline Data mining tasks Predictive tasks vs descriptive tasks Example