Multi-Atlas Segmentation using Clustering, Local Non-Linear Manifold Embeddings and Target-Specific Templates

Size: px

Start display at page:

Download "Multi-Atlas Segmentation using Clustering, Local Non-Linear Manifold Embeddings and Target-Specific Templates"

Cory Lamb
5 years ago
Views:

1 Multi-Atlas Segmentation using Clustering, Local Non-Linear Manifold Embeddings and Target-Specific Templates CHRISTOPH ARTHOFER, BSc., MSc. Thesis submitted to the University of Nottingham for the degree of Doctor of Philosophy September 2017

2 Contents Abstract 6 Acknowledgement 7 1 Introduction Medical image segmentation Approaches to brain MR segmentation General overview Classical segmentation methods Higher-level segmentation methods Atlas-based segmentation Multi-atlas segmentation (MAS) Our approach Overview of our approach Contributions Outline Anatomical atlas building and MAS Introduction Increasing the number of atlases Joint group-wise registration and segmentation methods Patch-based methods with linear registration Reducing the computational burden Pre-processing Intensity inhomogeneity correction

3 2.2.2 Skull-stripping Tissue classification Image registration Similarity measure Transformation Optimisation Anatomical atlases and label probability maps Single-subject atlases Population atlases Unbiased probabilistic atlas construction Creation of label maps Tools, datasets and evaluation strategies Our approach to template building Pre-processing Estimating a population template Transferring the label maps to the target Experiments Discussion Target-specific template Introduction Comparison basis Image intensities Non-image information Registration consistency Anatomical geometry Deformation fields Similarity metrics Basic similarity metrics Manifold learning Principal component analysis

4 Isomap Local linear embedding Our approach to TST building Manifold embedding TST construction Experiments Reconstruction of a GM image slice with eigenimages Reconstruction of a deformation field with eigenimages Comparison of nonlinear dimensionality reduction methods and parameter optimisation Method validation on the LONI dataset Method validation on the ADNI-HarP dataset Method validation on the IBSR dataset Method validation on the NIREP-NA0 dataset Method validation on the MICCAI 2012 dataset Method validation on the MICCAI 2013 dataset Discussion Label Fusion Majority voting Weighted voting STAPLE SIMPLE STEPS MALP Joint label fusion Patch-based label fusion Our approach to propagation, fusion and binarisation Experiments Evaluation on the LONI dataset Evaluation on the ADNI-HarP dataset

5 Evaluation on the IBSR dataset Evaluation on the NIREP-NA0 dataset Evaluation on the MICCAI 2012 dataset Evaluation on the MICCAI 2013 dataset Discussion Dynamically adjusted labels Offline learning ROI selection Clustering methods K-means Affinity propagation Hierarchical agglomerative clustering Our approach to label adjustment Dividing the labels into clusters Combining the clusters Experiments Evaluation on the ADNI-HarP dataset Evaluation on the MICCAI 2013 dataset Evaluation on the NIREP-NA0 dataset Discussion Application to Tourette s images Introduction Healthy brain development Tic disorders and Tourette s Syndrome Method Results Discussion Conclusion and perspectives Chapter overview

6 7.2 Conclusion Perspectives Bibliography 169 5

7 Abstract Multi-atlas segmentation (MAS) has become an established technique for the automated delineation of anatomical structures. The often manually annotated labels from each of multiple pre-segmented images (atlases) are typically transferred to a target through the spatial mapping of corresponding structures of interest. The mapping can be estimated by pairwise registration between each atlas and the target or by creating an intermediate population template for spatial normalisation of atlases and targets. The former is done at runtime which is computationally expensive but provides high accuracy. In the latter approach the template can be constructed from the atlases offline requiring only one registration to the target at runtime. Although this is computationally more efficient, the composition of deformation fields can lead to decreased accuracy. Our goal was to develop a MAS method which was both efficient and accurate. In our approach we create a target-specific template (TST) which has a high similarity to the target and serves as intermediate step to increase registration accuracy. The TST is constructed from the atlas images that are most similar to the target. These images are determined in low-dimensional manifold spaces on the basis of deformation fields in local regions of interest. We also introduce a clustering approach to divide atlas labels into meaningful sub-regions of interest and increase local specificity for TST construction and label fusion. Our approach was tested on a variety of MR brain datasets and applied to an in-house dataset. We achieve state-of-the-art accuracy while being computationally much more efficient than competing methods. This efficiency opens the door to the use of larger sets of atlases which could lead to further improvement in segmentation accuracy. 6

8 Acknowledgement Throughout the journey leading to this Ph.D. thesis I have been supported in various ways and influenced by many people, making it a difficult task to acknowledge each and every single one in this section. First and foremost, I would like to thank my primary supervisor Dr. Alain Pitiot for the countless meetings and inspiring discussions, the constructive feedback and comments, and the encouragement and motivation, all of which will be missed. I would also like to send my gratitude to my second supervisor Prof. Paul Morgan for sharing his extensive scientific experience and giving valuable advice. I am very grateful for having had such kind and thoughtful supervisors. Working as an MR scanner operator alongside my Ph.D. provided complementary experience from the MR acquisition side and was made possible by Jan Alappadan Paul and Prof. Penny Gowland. Thank you also to Prof. Stephen Jackson and the James Tudor Foundation for the generous funding support and the opportunity to work in his laboratory. I hope we will cross paths again in the future. I would also like to thank my examiners Dr. Juan Eugenio Iglesias and Prof. Tony Pridmore for their time and helpful feedback. I would like to thank my loving family, Elisabeth, Rainer, Christine, Karl and Berta for their encouragement and support despite being 1204 km away. I will be forever grateful for their continued investment and confidence in me achieving my dreams. If it was not for them this Ph.D. would not have been possible. I am also very blessed to have Filipa in my life, who has spread optimism and given me strength along the way. Thank you to Darren, Antonis and Tom, who filled the office with a great atmosphere and to Susi, Winti, Kathi, Martina, Sebastian, Niki, Johanna and Benjamin for their unwavering friendship. 7

9 Chapter 1 Introduction Image segmentation is the process of partitioning an image into multiple components. While manual segmentation is time-consuming and poorly reproducible, automated methods, and in particular atlas-based segmentation, have shown to be efficient alternatives. In this chapter the aims of image segmentation and various different approaches for the segmentation of MR images will be presented. Due to our main interest in atlas-based segmentation methods, we will provide an overview of its basic underlying concept as well as advanced solutions followed by an outline of our approach and contributions. 1.1 Medical image segmentation Aims of medical image segmentation: Image segmentation, in the general sense, is defined as the partition of an image into nonoverlapping homogeneous regions or objects, with respect to certain features or characteristics, and their separation from the background [166]. Segmentation, applied in the field of medicine and, in particular, to medical images acquired by magnetic resonance imaging (MRI), is commonly required for delineating anatomical regions of interest such as the brain, different tissue types such as grey matter or white matter, or regions associated with function, activity 8

CHAPTER 1. INTRODUCTION 9 a.) b.) c.) Figure 1.1: Aims of medical image segmentation on examples of segmented a) regions of interest, b) brain and c) grey matter. or pathology (Fig. 1.1).

progression or development [34]. Consequently, we require segmentation methods that provide accurate and reproducible results.

10 CHAPTER 1. INTRODUCTION 9 a.) b.) c.) Figure 1.1: Aims of medical image segmentation on examples of segmented a) regions of interest, b) brain and c) grey matter. or pathology (Fig. 1.1). It is often one of the first steps in a series of image analysis tasks such as planning of medical treatment or surgery [61, 143], diagnosis and patient follow-ups [114], and monitoring of disease progression or development [34]. Consequently, we require segmentation methods that provide accurate and reproducible results. Challenges in image segmentation: Typically, the segmentation of an image combines two main challenges. The first is to identify certain image features like homogeneity, continuity or similarity extracted from intensity values, differences in texture or gradients in border areas (Fig. 1.2). The second challenge is to transform these extracted features into semantic entities and provide context. This makes it a classification task, which is one of the most difficult tasks in the image processing domain [28, 91, 210]. The segmentation of an image yields groups of voxels that belong to the same structure or region of interest, requiring each voxel of the image to be classified and assigned to one of the groups. Although, over the years, various concepts have been extensively studied, the increasing number of different requirements and their associated challenges allow for no general solution which is applicable to problems from all disciplines [76].

11 CHAPTER 1. INTRODUCTION 10 a.) b.) c.) Figure 1.2: Segmentation based on a.) image intensities [28], b.) gradients and c.) texture [116]. Common problems in MRI: The acquisition of images with different modalities, scan protocols or with scanners from different manufacturers, as well as the underlying physical signal acquisition itself pose challenges such as [232]: intensity non-uniformities, e.g. bias fields where voxels of varying intensities belong to the same tissue type, movement artefacts known as ghosting, partial volume effects where voxel intensities represent a mixture of different tissue types, frequency/phase wrapping caused by aliasing, noise, poor image contrast and weak boundaries. All of these artefacts can have an impact of varying degree on the quality of the image processing pipeline. This makes early detection and, where possible, correction crucial.

12 CHAPTER 1. INTRODUCTION Approaches to brain MR segmentation General overview Although manual segmentation and interpretation by experts is still considered as the gold standard, it does not come without drawbacks. It is a very time-consuming and cost-intensive task, since every voxel has to be assigned a label. The segmentation accuracy is restricted by the variability introduced due to the operator(s), as different experts produce different segmentation results of the same structure (inter-observer variability). Even when the same expert performs the segmentation of an image repeatedly, variability between the resulting segmentations is introduced (intra-observer variability) [232]. An early assessment by Clarke et al. [53] outlined validation methods to measure reproducibility and compared various volume measurement studies. For example, the segmentation of the hippocampus with manually supervised methods showed inter- and intra-observer variability of 14% and 10% respectively. Although technical equipment for acquisition and screening of MRI and segmentation protocols such as the Harmonized Hippocampal Protocol [35, 84] have improved since Clarke et al. s study approximately 20 years ago, manual and manually supervised segmentation is still used as the gold-standard. An evaluation of the commonly used public dataset from the more recent MICCAI 2013-SATA challenge showed inter-scan reliability of 68% for the basal forebrain and 79% for the middle occipital gyrus, measured with the the Dice overlap coefficient [17]. In order to improve accuracy with respect to a manual gold-standard and reproducibility, a wide variety of automated segmentation algorithms have been developed. They can be classified in many ways [166] such as based on their degree of automatism (manual, semiautomatic, automatic), their spatial extent (local pixel-based, global region-based), or the concept (area-based, edge-based). Methods can also be categorised into classical (thresholding, edge-based, region-based), statistical and neural network techniques [160]. In the following sections we will give an overview of commonly used classical

13 CHAPTER 1. INTRODUCTION 12 and higher-level methods. In general, the former are based on low-level image processing, while the latter use classification and clustering methods which can also incorporate a priori anatomical knowledge Classical segmentation methods One fast and simple method is global binary thresholding where a fixed threshold is used to split the image based on its pixels intensity values. Support for finding the ideal threshold is given by analysing the histogram of the entire image [120, 149, 164]. This approach was further extended for multilevel thresholding [242], allowing the original image to be divided into more than two classes and local thresholding [48] for more regional decision making. In general, thresholding is simple to implement and yields fast computation times, but is influenced by the amount of noise, intensity contrast and anatomical complexity in the image. In contrast, region growing aims to merge pixels into homogeneous, connected regions [2]. Every pixel in the neighbourhood around one or more starting points is examined and added to the region if a common homogeneity criterion is fulfilled. The outcome of the algorithm strongly depends on the chosen condition and is heavily influenced by the defined starting points. Conversely, instead of growing a region by merging pixels, one region can be divided into sub-regions based on a homogeneity criterion until no further splitting is possible [39]. In order to combine the advantages of both methods, a split-and-merge algorithm [103] was developed. Another computationally efficient group of methods is based on determining the edges of a structure. Commonly used edge detectors are the Sobel [192], Prewitt [158] or Canny [42] gradient operators based on first order derivatives, or the Laplacian operator based on second order derivatives. Due to the operators sensitivity to noise, smoothing the image is recommended. The watershed algorithm [32] considers the 2D grayscale gradient or topographic distance image of the target image as a 3D surface. The grayscale

14 CHAPTER 1. INTRODUCTION 13 values represent heights with local minima as sources of basins and increasing values as ridges. Starting at each source, the basins are flooded. As the water rises and water sources meet, a barrier, i.e. a watershed, is created. The method tends to over-segment images, which usually requires merging of partitions in a post-processing step Higher-level segmentation methods Clustering methods such as unsupervised methods aim at partitioning the data into non-overlapping groups based on certain homogeneity attributes as measured by the clustering criterion. Since each clustering technique implicitly imposes a structure on the data, the method should be chosen carefully by considering the data under analysis. Commonly used approaches include hierarchical methods and methods based on a sum of square error criterion such as k-means [137]. Clustering methods will be discussed in Section 5.3 in more detail. Alternatively, image segmentation can be seen as a pixel classification problem. Statistical approaches aim at characterising a structure and assigning it a category based on its features or attributes. Classifiers belong to the group of supervised methods, which discriminate new input data based on a classifier learned from a pre-labelled training set. Examples for supervised segmentation methods include active shape [63] and active appearance models [62], which are based on active contour models [54, 115] and use deforming curves or surfaces to segment a structure. These curves are controlled by internal and external forces where external forces pull the curve towards a desired object shape and internal forces preserve the local tension or smoothness of the curve. Due to the availability of large annotated datasets neural networks (NNs) have gained a lot of attention. NNs consist of a large number of interconnected nodes, or neurons. Each neuron has a set of incoming connections and one outgoing connection, which represent weighted inputs and the output respectively. By organizing these networks into layers, the weights

15 CHAPTER 1. INTRODUCTION 14 and topological relationships between the variables can be updated and dynamically adjust to a task, which allows the modelling of complex nonlinear relationships. Networks with multiple layers between input and output are considered as deep neural networks. One type of commonly used NNs are convolutional neural networks (CNNs) where the input image is convolved with kernels at every layer. The convolutions generate sets of features which are non-linearly transformed. CNNs usually include pooling layers where the output of previous layers is combined by applying filters. CNNs can classify each pixel individually or produce likelihood maps. Another important and powerful high-level technique which incorporates anatomical a priori information is atlas-based segmentation [170]. In the following sections we will introduce single- and multi-atlas segmentation before presenting an overview of our approach Atlas-based segmentation The traditional concept of atlas-based segmentation uses a single model of the human brain, which will be referred to as an atlas, to segment a new anatomical intensity brain image, referred to as the target (Fig. 1.3). The atlas, in its simplest form, consists of one individual anatomical intensity image and its corresponding label map (see Section 2.4.1). It is of importance to note that in the literature the term atlas is sometimes not further specified and might refer to an individual single atlas or a probabilistic atlas. In this manuscript special care will be taken to indicate the type of atlas if not evident from the context. The target image is aligned to the anatomical atlas image by estimating the spatial transformation between them using image registration. This process aims at finding the best mapping between the anatomical structures in the target and the corresponding anatomical structures in the atlas intensity image. The label maps from the atlas are transferred to the target by applying the inverse of the same transformation

16 CHAPTER 1. INTRODUCTION 15 Atlas D -1 Target Figure 1.3: Single atlas-based segmentation concept. to them. This algorithm crucially depends on the quality of the registration, which is the reason why in the literature it is often referred to as a registration problem rather than a segmentation problem. One of the first implementations of this basic concept for 3-D information propagation was presented by Collins et al. [56] and Dawant et al. [69]. The advantage of their concept is its independence from the selected atlas labels, which allows the use of multiple different atlas label maps without the need to re-compute the mapping between atlas- and target-space. The registration can be performed with a time-efficient affine registration or with a nonlinear method. In general a linear registration leads, without further post-processing, to a poor alignment, and, in turn, poor segmentation quality. Commonly a combination of linear and computationally more expensive nonlinear registration methods is used, which leads to a better alignment of anatomical structures and consequently better segmentation results. For instance, Christensen et al. used a diffeomorphic registration method [50], which used a low-dimensional transformation for global shape differences and a high-dimensional vector field for the alignment of fine structures based on linear elasticity and fluidity models.

17 CHAPTER 1. INTRODUCTION 16 Their atlas-based model was used to automatically label a cortical surface by elastically deforming a pre-labelled 3-D surface atlas so that the cortical sulci of the atlas align with the corresponding image features in the target image [68, 179]. The segmentation accuracy crucially depends on the quality of the mapping between the target and atlas images. A one-to-one mapping between every point in the two images might not always be possible due to the high anatomical variability. Over the last decades, anatomical variability has been investigated in the whole brain as well as for particular structures showing that volume, distribution of grey and white matter, and anatomical shape vary considerably between individuals [9, 24, 152, 231]. The quality of the registration is also limited by the registration model and can lead to registration errors (see Section 2.3). In an extensive evaluation, one linear and 14 nonlinear registration methods have been compared on a set of 80 manually labeled brain images [122]. One of the surrogate evaluation strategies measured the overlap as the agreement of deformed source and target label volumes averaged across all regions and brain pairs. A maximum overlap of approximately 71% was reached for the LONI-LPBA40 dataset [185] with the SyN registration method [21]. One way to improve registration accuracy is to use an atlas which is expected to lead to the most precise registration and, in turn, segmentation. The comparison of different atlas selection strategies has shown that an individual atlas with an intensity image more similar to the target can achieve better segmentation accuracy than the use of a randomly selected atlas [57, 85, 169]. The choice of atlas also depends significantly on the ROI [72]. While a single atlas might provide the best possible result for some ROIs it does not necessarily lead to the best outcome for other ROIs. It was concluded that there is no single atlas obtained from one individual subject that would provide the overall most accurate segmentation of a new target image. Consequently, it has been suggested to use more than one atlas to capture a wider range of inter-subject variability. In a series of studies

18 CHAPTER 1. INTRODUCTION 17 Atlas = avg... S 1 & L 1 S 2 & L 2 S N & L N D I 1 Figure 1.4: Probabilistic atlas-based segmentation concept. the use of multiple individual atlases over a single individual atlas has shown improved segmentation accuracy [4, 169, 239]. One way to make use of multiple atlases is, as previously indicated, the selection of the single most similar atlas to the target from a set of atlases and its use for segmentation. Although this strategy accesses the information of multiple atlases, it eventually does not fully take advantage of all atlases. Another way to incorporate the labelling information of multiple atlases is by building an average-, population-, statistical- or probabilistic atlas (Fig. 1.4) which provides a value for each voxel indicating its probability of being part of a particular label (see Sections ). Similar to the procedure with an individual atlas, the probabilistic atlas intensity image is registered to the target and its corresponding probabilistic label maps propagated. The use of one standardised space presents a large advantage in terms of performance. It requires only one nonlinear registration to the target at runtime and the impact of registration errors can be partly alleviated by the probabilistic maps. All individual atlases and registered targets are linked via their

19 CHAPTER 1. INTRODUCTION 18 deformation fields which makes the mapping between each of them possible by composing the respective deformation fields. In comparison, without an average atlas, computationally expensive nonlinear registrations have to be performed between each of the individual atlases and every new target. Although computationally very efficient, the use of a probabilistic atlas has shown to provide less accurate segmentation results compared to the direct use of each individual atlas for target segmentation, which will be referred to as multi-atlas segmentation and explained in more detail in the next section Multi-atlas segmentation (MAS) In the previous section, segmentation based on a single atlas was introduced. The single atlas of choice can either be from one individual, which, for example, could be selected as the best possible match from a set of individual atlases, or be represented as a probabilistic atlas, constructed from multiple individual atlases. In contrast to a probabilistic atlas, Multi-Atlas Segmentation (MAS) directly utilises the label information from all individual atlases of a set. A new target image is nonlinearly registered to each individual atlas resulting in as many deformation fields as there are atlases (Fig. 1.5). The same deformation fields can be applied to the corresponding atlas label maps to warp them to the target, resulting in candidate labels, which can be combined into the final segmentation with a label fusion algorithm. This was shown to outperform single atlas-based segmentation methods [169] and is commonly used for segmentation problems. This is in line with the findings in the pattern recognition literature where the combination of classifiers is in general more accurate than one individual classifier [38]. One of the first implementations was proposed in a series of papers by Rohlfing et al. [169, 168, 172, 173]. In their most influential work [169] they used an optimised free-form deformation algorithm to non-rigidly register

CHAPTER 1. INTRODUCTION 19... S 1 & L 1 S 2 & L 2 S N & L N D 1 D 2 D N Target Figure 1.5: Multi atlas-based segmentation concept.

each of the atlas images to the remaining images yielding one segmentation candidate from each.

In [172] this simple vote rule or majority voting for the fusion of candidate labels was further extended with the expectation maximisation label fusion algorithm by Warfield et al.

20 CHAPTER 1. INTRODUCTION S 1 & L 1 S 2 & L 2 S N & L N D 1 D 2 D N Target Figure 1.5: Multi atlas-based segmentation concept. Each individual atlas intensity image is registered to the target and each resulting deformation field is used to transform the corresponding label map to the target. each of the atlas images to the remaining images yielding one segmentation candidate from each. The final segmentation was determined by majority voting, which assigns to each voxel the label with the most occurrences at the corresponding location in the candidate images. In [172] this simple vote rule or majority voting for the fusion of candidate labels was further extended with the expectation maximisation label fusion algorithm by Warfield et al. [228], originally designed for the assessment of human labelling performance and estimation of the true segmentation. Since most of Rohlfing s initial work was done on bee brains, more papers about atlas selection strategies corroborated his findings for automatic labelling of the human brain [23, 123, 239] and the impact of the label fusion method was investigated by Heckeman et al. [98]. However, most top performing MAS methods rely on the accurate alignment between each atlas and the target, which is usually done by estimating the pairwise nonlinear spatial correspondence between each atlas and the target at runtime. This step is typically the most time-consuming part of the MAS

21 CHAPTER 1. INTRODUCTION 20 algorithm. Some methods have focused on running speed, e.g. Coupe et al. s patchbased approach [64] requires only linearly aligned atlases but achieved the lowest performance in Wu et al. s evaluation [237]. Two alternative strategies for reducing the number of registrations and improving registration accuracy either employ a target-specific template (TST), more similar to the target than an average population template [59, 61, 151, 187, 235], or construct a graph structure of intermediate similar atlas images [46, 88, 96, 110, 118, 147, 202, 225, 234] to split a large, potentially imprecise registration, into smaller more accurate deformations. However, TSTs are usually constructed as probabilistic atlases which does not allow the direct consultation of the warped individual atlas label maps. In most methods these probabilistic atlases are constructed from the locally or globally most similar individual atlas intensity images to the target by registering them iteratively to their average [59], which can increase the computational burden at runtime. Similarity measures such as normalised mutual information or the sum of squared differences are commonly used for comparison of intensity images. More recent methods have shown improved accuracy for candidate selection and label fusion by using distances between images calculated in manifold space [75]. Decisions solely based on image intensity values could potentially be corrputed by artefacts or inhomogeneities. In the presence of anatomical pathology the use of deformation fields for similarity comparison is more robust [59, 61, 151, 163]. Other methods split the given labels into smaller, more localised sub-regions with the goal to improve candidate selection and fusion [18, 126, 217, 178]. However, most methods split the regions randomly without taking contextually meaningful information into account. One drawback of graph-based strategies is the observed decrease in segmentation overlap almost linear to the number of composed links between intermediate templates [188]. In general, eight commonly found components in MAS methods (Fig. 1.6) were identified by Iglesias and Sabuncu [105], with some of them being op-

22 CHAPTER 1. INTRODUCTION 21 (1) Generation of atlases (2) Offline learning (3) Registration Novel image given (4) Atlas selection (5) Label propagation (6) Online learning (7) Label fusion (8) Post-processing Figure 1.6: Components of MAS. Optional components are indicated by the dashed boxes. Figure from Iglesias and Sabuncu [105] tional. Given a set of atlases (1), some methods learn from or analyse the atlases offline, for example by constructing various templates or a graph structure (2). At runtime, this information is ready to use, which, in turn, improves computational efficiency. Once a target image is given, one or more atlas images are registered to it (3). Based on similarity criterions the atlas(es) with the highest expected probability of segmenting the target accurately can be selected (4) and the labels propagated (5). At runtime, another learning step can be performed (6) which can provide additional information for the concurrent label fusion (7). Some algorithms additionally apply post-processing methods (8) to smooth the borders or use the fused labels to initialise an active contour or level set algorithm. Over the years each of these areas has become a topic of research. In this thesis, we experimented with many approaches in some of those areas, and developed our own. The next section provides an outline of our method and our main contributions.

23 CHAPTER 1. INTRODUCTION Our approach Overview of our approach Motivated by the necessity to segment brain structures in MRI scans both accurately and in a time efficient manner, we have developed a novel MAS approach with the main aims of (1) reducing the number of nonlinear registrations at runtime, and (2) providing state-of-the-art accuracy and (3) robust results on diverse datasets. To reduce the number of registrations at runtime we use both an average population template and a TST. For clarification we introduce the following notation: a ROI refers to the anatomy of interest, for example the hippocampus; a label refers to the particular delineation of a ROI, such as the hippocampus as annotated in one atlas; region refers to a collection of voxels in images or deformation fields without regards to a particular ROI; cluster refers to one sub-division of a region with similar attributes. Offline, we construct a population template from a set of pre-segmented, affinely-registered atlas images and, by the same token, estimate a non-linear deformation field D between each atlas image Ĩ and the template G (Fig. 1.7.a). The same deformation fields are used to transform the corresponding atlas labels L to G and to find clusters that undergo similar or different deformation. Firstly, for each location we calculate the standard deviation of the corresponding displacement magnitudes of the deformation fields. We assume that a high standard deviation indicates dissimilar deformation within the dataset at this location, while a low standard deviation indicates a location with similar displacement in the dataset. Secondly, we calculate the union from the labels in the space of G, which provides the largest region covered by the labels of a ROI. For each union we cluster the corresponding voxels based on their standard deviations of the deformation fields and locations. The resulting set of clusters are warped back on each of the atlases, where we use the atlas labels to crop them. At runtime, we non-linearly register the previously unseen target image

24 CHAPTER 1. INTRODUCTION 23 Figure 1.7: Overview of our approach. The population template G is constructed from individual atlases offline (a). At runtime the target T is registered to G. A manifold embedding is constructed from the deformation fields for each ROI (b) allowing the ranking of the atlases based on their similarity to the target (c). A TST is constructed from the most similar atlas images warped on T by composing DĨj I and D T I (d). T is registered to the TST and the atlas label maps are propagated to T by composing DĨj I, D T I and D T T ST where they are fused. T to the template to obtain the target deformation field (first runtime registration). For each cluster we then create a manifold embedding from the corresponding regions extracted from the atlas and target deformation fields (Figure 1.7.b). The atlas regions are then ranked in order of similarity to the target region using their L 2 distances in manifold space (Fig. 1.7.c). For each label the atlas images are then warped onto the target and a TST is constructed, to which the target is non-linearly registered (second runtime registration). We then project the label maps onto the target by composing the deformation fields between the atlases and the population template, between the population template and the TST, and between the TST and the target (Fig. 1.7.d). The resulting candidate segmentations are fused using

25 CHAPTER 1. INTRODUCTION 24 weights based on the rankings in manifold space which allows to determine areas with low and high probabilities. In addition to the probability values, the final decision whether a voxel is part of a label is also based on the corresponding intensity value of the target by comparing it to the intensity distribution determined from high-probability areas Contributions In order to combine the efficiency of single-atlas based methods and the accuracy of MAS methods we propose a novel framework with the following improvements. First contribution: Minimising the number of registrations without compromising accuracy (1) We reduce the number of nonlinear registrations at runtime by constructing an average population template from all atlas images offline (see Chapter 2) and a TST from the most similar atlas images to the target online, requiring only two nonlinear registrations at runtime for each new target. (2) In contrast to other methods, the atlas images are non-linearly aligned with the target in an efficient way allowing the construction of the TST from the warped atlases on the target. This makes for an even more similar TST to the target than an average atlas or a target-specific probabilistic template constructed in average atlas space. (3) The target is non-linearly registered to the morphologically very similar TST and by composing the deformation fields, all atlas label maps can be propagated directly to the target and consulted for fusion. Second contribution: TST construction using region-specific manifold embeddings of deformation fields (4) We create a nonlinear manifold for each ROI rather than from the whole image and from the deformation fields rather than from the intensity images

26 CHAPTER 1. INTRODUCTION 25 to provide the most accurate weights for the TST construction and label fusion (see Chapters 3-4). We evaluated two nonlinear manifold embedding strategies on a dataset with over 50 ROIs and used a more region-independent fusion method. Third contribution: Splitting labels into meaningful clusters 5) In order to automatically select relevant sub-regions to allow for more accurate weight estimation and concurrent fusion our last contribution is the clustering of pre-defined ROIs based on the variability of the deformation. This allows the characterisation of regions that require a high and low amount of deformation at the population level (see Chapter 5) Outline In the remainder of this thesis we will discuss relevant methods from each of the components of the MAS concept (Fig. 1.6) and the impact of our approach and contributions. In the beginning of each chapter we will provide a literature overview followed by a presentation of our own approach and experiments. Chapter 2 outlines strategies for the spatial alignment of atlases with the target. We elaborate on the use of a population atlas for MAS and techniques for its creation. Chapter 3 provides an overview of computationally efficient MAS solutions. We present a fast and accurate MAS strategy that employs intermediate templates and manifold learning. Chapter 4 introduces and compares techniques for the fusion of candidate labels and the thresholding of probabilistic segmentations. Chapter 5 outlines strategies for offline learning to reduce the computational burden at runtime and/or improve segmentation accuracy. We present our approach of using localised sub-regions which are dynamically derived

27 CHAPTER 1. INTRODUCTION 26 from the given atlas labels. In Chapter 6 we present the segmentation results obtained by applying our method to brain MR scans of healthy subjects and Tourette s patients.

28 Chapter 2 Anatomical atlas building and multi-atlas segmentation 2.1 Introduction One of the essential parts of the MAS concept is the spatial alignment of the atlases with the target, which is achieved with image registration. In this chapter we provide an introduction to image registration methods, their use in MAS approaches and their effect on segmentation accuracy and runtime. In the remainder of this section we provide an overview of solutions for registration in MAS and commonly used pre-processing steps, before outlining image registration methods in more detail, and techniques to create atlases and construct population templates Increasing the number of atlases The quality of an image registration method depends on various factors such as the characteristic of the image, the ROI to be segmented and the underlying registration algorithm and its parameters. Different methods or even the same method repeatedly applied with different parameters yield different outcomes. Similarly to the pattern recognition literature, where it has been shown that multiple classifiers yield improved and more stable results 27

29 CHAPTER 2. ANATOMICAL ATLAS BUILDING AND MAS 28 over single classifiers [38], Rohlfing and Maurer [171] utilised these different outcomes to create a larger set of atlas-based classifiers. They repeatedly applied the same registration method with different sets of parameters to the same image, with the goal of improving classification accuracy. In related work by Doshi et al. [74], multiple different image registration methods were applied to the same images, leading to a set of multiple different registration results. A larger set of classifiers increased the probability of finding suitable candidates and showed superior segmentation results over using a single method. However, while a larger set of different classifiers can improve classification accuracy, the larger number of pairwise registrations with one or even multiple methods represents the most time-consuming part of MAS Joint group-wise registration and segmentation methods Traditionally the transformation is estimated between each individual atlas and the target, independently from the registrations of other atlases to the target. Groupwise methods [3, 97, 106, 110, 223] consider the strong reliance of the segmentation outcome on the quality of the registration and, conversely, the potential improvement in registration quality by incorporating label information, i.e. a more precise deformation field for label propagation provides better segmentation results and atlas labels provide important features to improve registration accuracy. For example, the generative probabilistic model by Iglesias, Sabuncu and Van Leemput [106] uses the consistency of voxel intensities within an ROI to simultaneously estimate the atlas-to-target deformation fields and target labels. In contrast to other MAS methods, these deformation fields are considered as model parameters alongside the Gaussian distribution parameters of the voxel intensities. The most likely values are estimated by a variational expectation algorithm. Although it can increase registration and segmentation accuracy, it can also have a negative impact on computational efficiency if all of the steps of the

30 CHAPTER 2. ANATOMICAL ATLAS BUILDING AND MAS 29 method have to be re-applied for every new target image Patch-based methods with linear registration One class of MAS methods which was originally proposed with only linearly aligned atlases is based on the non-local means method [41]. It was first utilised in MAS by Coupe et al. [64] and uses patches for the classification of target voxels (see Section 4.8). Patches around each voxel and around its neighbours in the atlas images are compared to the corresponding patch in the target image. Based on their similarity and the atlas labels at the specific location, a label can be assigned to the target voxel. Patches are selected within a specified search neighbourhood around each voxel which relaxes the one-to-one correspondences between target and atlases and makes the alignment of the images less constrained. Although proposed without the need for time-consuming nonlinear registrations, the repeated search through all of the voxels can still take a considerable amount of time and achieved less accurate results compared to other patch-based methods using nonlinear registrations. This lead to various approaches combining both nonlinear registration- and patch-based characteristics, which showed better results than the individual methods [13, 20, 25, 81, 174, 181] Reducing the computational burden The use of pairwise registrations, multiple pairwise registrations per atlas, group-wise registration, or patch-based schemes provides improved segmentation results at the cost of computational efficiency at runtime. This computational burden at runtime can be reduced by using a common coordinate system to spatially normalise all atlases offline. At runtime, a new target requires only one registration to this same space. In the literature three main approaches for the selection of the coordinate system can be identified. Firstly, a standardised individual brain image that is not part of the atlas set can be used for normalisation, e.g. Aljabar et al. [4, 5].

31 CHAPTER 2. ANATOMICAL ATLAS BUILDING AND MAS 30 Secondly, one of the atlas images can be selected as a reference space [188, 217]. However, the selection of a standardised brain image or an atlas image from the cohort might differ substantially from the remaining atlases or the target and might lead to inaccurate registration results. The third approach facilitates a population template, constructed from all atlases offline [12, 60, 61, 70, 81, 161, 187]. Although it does not guarantee similarity to the target, this approach can capture the variability within the atlas population, and, consequently, provide more accurate alignment in the average reference space. 2.2 Pre-processing Intensity inhomogeneity correction A general problem in MRI is the smooth intensity variation that can occur across the whole image. It can be caused by static field inhomogeneity and RF coil nonuniformity, and depends on pulse sequence and field strength. In most applications the correction for inhomogeneities is performed as a post-processing step by image processing tools such as SPM or Freesurfer [67] in addition to hardware-related solutions in MR acquisition. In SPM [15] inhomogeneity correction is automatically performed when segmenting an image into grey matter (GM), white matter (WM) and cerebrospinal fluid (CSF). The tissue classification in SPM is based on prior probability maps and a mixture of Gaussians model with more than one Gaussian for each class. The objective function of the model is extended by extra parameters to take intensity inhomogeneity into account. It uses a linear combination of low-frequency sinusoidal basis functions to model the spatially smooth nonuniformity. This integrated approach has shown to outperform competing methods [87] and can be applied to images acquired with different pulse sequences. However, the integration of the correction method into the segmentation algorithm which, in turn, is based on prior knowledge about tissue types, does not allow its independent application to other anatomical struc-

32 CHAPTER 2. ANATOMICAL ATLAS BUILDING AND MAS 31 tures than the brain. A dedicated method for intensity inhomogeneity correction is nonparametric nonuniform intensity normalisation (N3) [190]. Although it is conceptionally similar to SPM [130], N3 works without tissue class models, can be applied to pathological data, and is independent of the pulse sequence. The inhomogeneous field is assumed to blur the image which consequently reduces the high frequency components of the intensity distribution. The objective can be stated as finding the smooth, slowly varying, multiplicative field that maximises the high frequency content of the unblurred intensity distribution. This is achieved by iteratively sharpening the intensity distribution of the blurred image and estimating the smooth field, which produces the blurred distribution, until the field converges. The smoothing operator used for N3 correction is a B-spline approximator. In the work by [214], this B- spline approximator was adapted to allow its use with larger field strengths, faster execution times, higher levels of frequency modulation of the bias field, and to be more robust to noise. Additionally, it utilises a multiresolution optimisation which iteratively corrects the bias field and estimates the residual bias field Skull-stripping One of the first steps in the neuroimage processing pipeline is often the removal of extrameningeal tissues, such as eyes, fat and the dura from the whole head MR images. Since the succeeding image analysis steps in the pipeline use the skull-stripped images for further processing, their accuracy is influenced by the quality of the skull-stripping method. Potential over- or under-segmentation of brain tissue can lead to errors such as imprecise estimations of tissue volumes or its spatial distribution. The proposed methods can be classified in two main categories, edge based and template based. Edge based techniques aim at finding an edge between brain and non-brain structures. Edge based techniques are employed by the brain surface extractor

33 CHAPTER 2. ANATOMICAL ATLAS BUILDING AND MAS 32 (BSE) [186] which uses an anisotropic diffusion filter, a Marr-Hildreth edge detector, morphological filter and region growing. The brain extraction tool (BET) [191] estimates a brain/non-brain intensity threshold from the image histogram and calculates a rough estimate of the centre of gravity and the radius of the brain. These estimates are used to initialise a spherical tessellated surface model. Each vertex of the surface iteratively undergoes an incremental movement. The movement is governed by an intensity term that finds a local threshold between brain and non-brain intensities which also takes the global thresholds into account. Freesurfer [67] uses a hybrid approach [200]. Similarly to BET, general parameters such as intensity thresholds for CSF and white matter, the centre of gravity and the brain radius are estimated. In addition, the global WM minimum location is determined, which is used as the main basin for the subsequent watershed algorithm. This results in a first estimate of the brain volume and is used to initialise a deformable surface algorithm. The active contours method iteratively deforms a template while using the brain volume from the watershed algorithm as a mask. The resulting surface is compared to a statistical atlas created from manually segmented images, which allows further refinement in case of segmentation errors. More recent algorithms also use convolutional neural networks [121], which can be trained and used on different modalities and pathologically altered head scans. The second group of methods uses a deformable atlas with expert delineations which is registered to the target. This makes it more robust to intensity inhomogeneity and different acquisition protocols. A hybrid method that combines the discriminative attributes of a Random Forest classifier to detect the brain boundary and the generative attributes of a point distribution model is ROBEX [104]. Offline, the Random Forest classifier is trained on a set of features derived from the voxel intensities, after intensity inhomogeneity correction and intensity normalisation. The final classifier yields a probability volume, which indicates the probability of each voxel to be part of the brain boundary. A point distribution model is constructed from a set

34 CHAPTER 2. ANATOMICAL ATLAS BUILDING AND MAS 33 of training images with corresponding landmarks. The result is a model of possible shapes which can be deformed according to the active shape model explained in Section At runtime, a new target undergoes the same pre-processing steps such as intensity normalisation, feature calculation and voxel classification. Conceptually similar to single-atlas based segmentation, the point distribution model is fitted to the probability volume of the target with a non-linear deformation to refine the alignment. The final step involves the construction of the binary volume. Depending on the registration approach used, MAS-based methods can be computationally more expensive and time-consuming. For example computational time is a limitation of Doshi et al. s approach [73]. They used DRAMMS [150] for the nonlinear alignment of a selected study-specific template to the target because of its robustness to intensity inhomogeneities, noise, and outliers or missing anatomical regions. Eskildsen et al. presented a faster alternative [77], which is based on the nonlocal means patch-based approach and requires only linearly aligned atlases Tissue classification In SPM spatial image normalisation (i.e. registration to a standard space template), tissue classification, and bias correction is combined in a unified model [15]. It is assumed that the distribution of voxel intensities in each set of predefined tissue classes is normal or a mixture thereof and can be described with its respective mean and variance (mixture of Gaussians). The spatial probability of GM, WM and CSF is provided by the ICBM MNI 152 atlas in a normalised stereotactic space and it is assumed that intensity inhomogeneity, i.e. a smooth varying field, is present. Due to the many unknown variables, an iterative algorithm with initial estimates, given by probability maps and a uniform bias field, was implemented. Each iteration starts by calculating the cluster parameters from the probability maps and the bias field. Then the probability maps and the bias field are updated until

35 CHAPTER 2. ANATOMICAL ATLAS BUILDING AND MAS 34 convergence. In each iteration, the probability maps change slightly towards the distribution of the target image. In detail, the cluster parameters are calculated by computing the number of voxels that belong to each cluster, which further allows the calculation of the weighted mean of the image voxels and the variance for each cluster. From these cluster parameters, the prior proability images, the bias corrected field and the new label probability maps can be constructed following Bayes rule. The smooth varying bias field is modulated as a linear combination of low frequency discrete cosine transform basis functions. FSL s FAST tool [244] incorporates both a hidden Markov random field (HMRF) model and an expectation-maximisation (EM) algorithm. Over models fully specified by the histogram (finite mixture models), the MRF model has the advantage of taking spatial and, in turn, structural information into account. The spatial information is incorporated by a contextual constraint via a neighborhood system. The HMRF is an extension to this MRF and indicates that the states of the random field can not be directly observed. However, given a particular neighbourhood configuration and its parameters, a known conditional probability distribution can estimate them. The algorithm iterates through three EM steps, which include estimation of the class labels by MRF and maximum a posteriori (MAP), estimation of the bias field performed by MAP, and estimation of the tissue parameters performed by maximum-likelihood (ML). It is initialised with the mean and standard deviation for each class type, GM, WM and CSF, extracted from the histogram with Otsu s thresholding method [149]. Since MRF and HMRF are not the main focus of this manuscript, I would refer the reader to the above mentioned literature for more details. Like FAST, ANTs Atropos [22] is also based on a finite mixture model (FMM). However, a FMM assumes independence between voxels and does not take spatial and contextual considerations into account. Atropos uses prior probability models to incorporate this information and provides three options. Similar to FAST, a MRF can be used. Another way is to register

36 CHAPTER 2. ANATOMICAL ATLAS BUILDING AND MAS 35 and warp labelled templates to the target to create a prior probability map. A third option in Atropos allows the user to label (sparse) points of the image, which provides an initialisation for the subsequent EM optimisation. In the E-step the label estimates for each voxel are updated based on the current estimates of the model parameters by computing the lower bound of the objective function. In the M-step the estimates of the model parameters (mean,variance) are updated by optimising this bound. 2.3 Image registration Medical image registration has been an active area of research for over 20 years with the goal of estimating the spatial correspondence between anatomical structures in one image and the corresponding anatomical structures in a second image. The result is a transformation which indicates the correspondence between voxels across both images. This correspondence can be established by superposing landmarks or by matching voxel intensities. Landmark-based methods look for the same features in the images and try to align them with a transformation. Intensity-based methods try to increase a similarity measure. It is important for both approaches to define valid constraints that govern the process of finding the best images [85]. Although a one-to-one mapping between each of the voxels of two images would yield a perfect registration, in practice, it is very difficult to achieve because of intensity inhomogeneities, partial-volume effects, tissue motion, pathology, artefacts or simply because anatomy varies between subjects [65]. The evaluation of registration quality can be based on the use of labels [122] or features, automatically derived from the input images [193]. Both approaches are prone to errors because manually denoted labels might be inconsistent due to inter- and intra-subject variability and automatic features might not be correctly detected in complex structures with a high anatomical variability like the brain. An image registration method based on intensities usually consists of three

37 CHAPTER 2. ANATOMICAL ATLAS BUILDING AND MAS 36 main components: Similarity or error measurement Optimisation function Transformation model In the following sections an outline of these key elements will be provided. For further details the reader may refer to [196] Similarity measure Similarity measures can be broadly categorised into three main groups: Featurebased, voxel-based and hybrid methods. Feature-based measures use geometric landmarks including points, lines or surfaces aiming at decreasing the (typically Euclidean) distance between corresponding features. These landmarks can be manually placed or automatically detected. Consequently, the registration accuracy largely depends on the ability of the feature extraction method to find corresponding landmarks in the images. Voxel-based measures are based on image intensity values with the goal to maximise the similarity of corresponding intensity patterns. Due to this dependency, we can distinguished between intra- and inter-modal registration. Common measures of similarity for images of the same modality include sum of squared differences (SSD), sum of absolute differences (SAD) and cross correlation (CC). Multi-modal image registration methods usually employ probability based methods, e.g. mutual information, which allow the comparison of images in terms of information theory (entropy) with the goal to maximise the amount of information one image yields on the other [55, 138, 230]. Another approach to inter-modal registration employs a patch-based method (see Section 4.8) to measure similarity between patches in a local neighbourhood of an image and detect preserved anatomical structures [101].

38 CHAPTER 2. ANATOMICAL ATLAS BUILDING AND MAS Transformation The transformation is the mathematical model that describes the geometric distortions that are applied to an image. Based on the chosen model and its mathematical constraints, the transformation can gain properties such as inverse consistency, symmetry, topology preservation or diffeomorphism. Inverse consistent and symmetric methods ensure that the path along which an image is transformed to a second image is unbiased by the computation direction or space of either one of the images. Consequently, the path used when registering an image A to image B is the same path when registering image B to image A unbiased by the target domain. Another desirable property for some applications is the transformation model s ability to preserve topology. By creating a continuous and locally one-to-one mapping with a continuous inverse, folding of the grid over itself, resulting in potentially unnatural anatomical structures, can be prevented. This can also be achieved with diffeomorphisms [51, 211], which are transformation functions that are differentiable (smooth) and invertible with smooth inverses. One linear model is the rigid-body model, which only allows rotations and translations (6 DOFs). It is a special case of the affine model, which additionally allows scaling and shearing (12 DOFs). These are commonly used to correct for global shape changes, but are too constrained to describe local shape changes. In contrast, nonrigid or nonlinear registration methods have more DOFs and can be applied to model local tissue deformations, align anatomical structures that vary within a population or quantify change over time. Some of the most commonly used registration methods are based on physical models and interpolation theory [196]. The former group includes elastic body models, diffusion models (Demons approaches) and flows of diffeomorphisms, while free-form deformations are an example of the latter.

39 CHAPTER 2. ANATOMICAL ATLAS BUILDING AND MAS 38 Elastic body models assume that the image has similar properties to an elastic solid material, which can be modelled with the Navier-Cauchy partial differential equation [26]. The equation is governed by internal and external forces. Based on the chosen similarity measure, the external force causes the deformation, while the internal force models the stress within the elastic material. Linear elastic registration methods cannot handle large deformations because the internal force of the equation increases proportionally with increasing external force making it only valid for small displacements. For large deformations, nonlinear elastic models were presented [153, 159]. Demons approaches model the transformation as a diffusion process [205]. The method considers the object boundaries in one image as semipermeable membranes, which allow the object of the second image to diffuse through them. Points on the membrane are called demons and decide whether a point of the moving object should be pushed inside or outside by iteratively computing each demon s force and updating the transformation based on all these individual forces. One way to estimate the forces is with the optical flow equation. Thirion s Demons approach has built the basis for a whole group of diffusion-based methods, most of which are based on the same iterative approach of estimating each demon s forces and updating the transformation based on these forces. The original model did not ensure diffeomorphic transformations which was later implemented by Vercauteren et al. [218]. Flows of diffeomorphisms estimate a displacement by integrating a velocity field over time with the Lagrange transport equation [51]. The velocity field can vary over time, which allows the esimation of large deformations as a composition of a series of small deformations [29]. Due to the integral of the velocity field along a path the deformation field is necessarily diffeomorphic which is why this framework is also known as large deformation diffeomorphic metric mapping (LDDMM). The shortest path in the space of diffeomorphisms is called the geodesic path which allows the comparison of

40 CHAPTER 2. ANATOMICAL ATLAS BUILDING AND MAS 39 distances between points or images. DARTEL (Diffeomorphic anatomical registration using exponentiated Lie algebra) [14] is based on diffeomorphisms and allows rapid computation of the deformations due to its use of a constant velocity field and a full multigrid method which recursively goes through the scales. The use of a stationary velocity field, where the whole movement of a point over a series of time steps is integrated into a single fixed velocity field, does not allow relating each point in the flow field to the corresponding point in the brain at each time step. Starting by calculating the first and second derivatives of the objective function for a variable velocity vector field, it is thereafter constrained to constant velocity. This constraint limits the achievable diffeomorphic configurations. ANTs-SyN : Avants et al. s symmetric image normalisation method (SyN) [21] extends the initially asymmetric LDDMM by guaranteeing that the geodesic mapping between two images is symmetric for every chosen similarity measure, not only for intensity differences. This is achieved by decomposing the diffeomorphism that deforms an image into a second image into two parts. Each of the images contributes equally to the whole deformation by constraining the sub-diffeomorphisms to be the same at half of their respective integration time, which is equivalent to half of the respective distances of the whole geodesic path. Local cross correlation was chosen as similarity measure, due to its robustness to intensity variations. Free-form deformation (FFD) methods use a mesh of control points which can be manipulated with spline functions and consequently deform an object in the image. Different types of splines have been used where cubic B-splines [177], where adjusting the location of one control point impacts only a local neighbourhood of points, have become the most popular. FFDs have been extended by adding properties such as topology preservation and diffeomorphisms [176], symmmetry [148] and inverse consistency [79].

41 CHAPTER 2. ANATOMICAL ATLAS BUILDING AND MAS 40 In a comparison by Klein et al. of 14 nonlinear registration methods [122] DARTEL and SyN were amongst the highest ranked methods in terms of overlap- and distance measures Optimisation In most methods, the transformation is iteratively refined by estimating new transform parameter values to optimise the similarity measure of the images in the next iteration. This gradual improvement and subsequent similarity assessment is repeated until an optimum is found. The overall goal of the optimisation algorithm is to find the global optimum. One way to find an optimum is by using gradient-based techniques (gradient descent), which, however converge to a local optimum. A starting estimate close to the global optimum can be provided by reducing the capture range with a multi-scale method. By down-sampling, and then repeatedly up-sampling and registering the images, the transformation at each resolution level can be used as an initial estimate for the next registration step. 2.4 Anatomical atlases and label probability maps Due to image artefacts such as intensity inhomogeneities, partial volume effects or similarities in intensity distributions of different anatomical structures, prior information is crucial for the automated segmentation of MR brain images. In atlas-based segmentation methods, this prior information is provided in form of a training dataset annotated manually by experts, which classifies it as a supervised learning method. The first step in atlas-based segmentation requires the building of one or multiple anatomical brain atlases and their corresponding labels. It is crucial to clearly define what is considered as a brain atlas, as depending on the construction and application, scans from individual subjects and probabilistic

42 CHAPTER 2. ANATOMICAL ATLAS BUILDING AND MAS 41 templates are often referred to as atlases in the literature Single-subject atlases In earlier publications an atlas represented the annotation of anatomical structures and was called topological, single-subject or a deterministic atlas, with one of the most well-known examples in medicine being the Talairach atlas [201]. Talairach s main work aimed at describing and locating anatomical structures not only based on their shape but also based on their location relative to each other. In his coordinate system he used the anterior commissure to the posterior commissure for alignment and additionally introduced a parallel and orthogonal grid proportional to the skull size, which already provided the basis for the reconstruction of 3-dimensional volumes from 2- dimensional projections and is still widely used. With advances in imaging methodologies, more complex atlases have been developed. One representative example of a deterministic whole body atlas is the Visible Human Project. CT and MRI images were obtained in addition to cryosection images from whole human female and male cadavers, resulting in a complete digitial image dataset of the human body. In a similar project, the Computerised Brain Atlas (CBA) was constructed from cryosections with the main goal to provide a brain template for normalisation with positron emission tomography or other imaging modes [94]. It contains 3-dimensional annotations of the brain surface, the ventricular system, the cortical gyri and sulci, as well the Brodmann cytoarchitectonic areas. More recent atlases have been developed without the need for cryogenic images. Instead they have solely been constructed from non-invasive imaging modalities such as MR or CT scans [27, 117]. A high-resolution brain atlas was constructed by the McConell Brain Imaging Centre and used for the

43 CHAPTER 2. ANATOMICAL ATLAS BUILDING AND MAS 42 BrainWeb database, which is a simulator for the creation of realistic MRI data volumes [58]. With the intention to create an average atlas with high SNR and structure definition, the Colin 27 brain atlas [102] was constructed from 27 linearly aligned scans of the same subject. However, after bringing it into stereotaxic space, it has also been frequently used as a template for alignment Population atlases Due to the large anatomical inter-subject variability, there is no single brain anatomy scan capable of representing a whole population. Probabilistic atlases, constructed from a set of images, have emerged as the tool of choice in the representation, analysis and interpretation of population-based imaging studies. Population atlases provide a common 3D (stereotaxic) space for normalisation and a reference for alignment. By mapping images from a cohort into the same space, population atlases provide a probabilistic map of the spatial location of structures of interest and an estimate of their shape (intrapopulation variability), allow the estimation of morphological differences between distinct populations (inter-population differences) and provide support in the segmentation of new target images. The use of probabilistic atlases has multiple advantages over individualsubject atlases: Since it represents the average shape and appearance of the population, it provides a space for normalisation for both the individual images of the population and the targets. For the individual images, it requires the least amount of deformation to deform to their average. Consequently, for targets with similar morphological properties to the population, the quality of the registration can be increased, resulting in more precise deformation fields.

44 CHAPTER 2. ANATOMICAL ATLAS BUILDING AND MAS 43 Probabilistic atlases can be constructed for different populations, such as cohorts of different age, gender or pathology [241]. For an individual target image, the most suitable probabilistic atlas, which requires the least amount of deformation, can be selected for spatial normalisation [162, 169, 239]. On population level, the probabilistic atlases can be compared to find morphological differences. Three-dimensional spatial atlases can incorporate information from additional target images later on and can be extended into the time domain, which makes it possible to compare diseases at different points in time or investigate developmental disorders [90, 156, 184]. The main goals can be defined as finding a realistic representation that captures the structural and functional variability of a cohort individual scans and allowing the quantitative estimation of accuracy and errors [144]. One of the biggest attempts to create an atlas was made by the International consortium for brain mapping (ICBM) in a worldwide collaboration of imaging centres. Demographic, clinical, behavioural and imaging data of a total of 7000 subjects were collected, including genetic information from approximately 80% of the subjects. The processing steps for the construction of probabilistic brain atlases from 152 and 452 subjects included 3-dimensional intensity non-uniformity correction, and intensity normalisation. In order to create an average atlas that is spatially as well as intensity unbiased by a single subject, they were constructed from the average position, orientation, scale, and shear from all individual subjects [144, 145, 146]. The Laboratory of NeuroImaging (LONI), also part of the ICBM, constructed a probabilistic brain atlas from 40 manually delineated MRI scans of healthy volunteers. The delineation was performed for 50 cortical structures, 4 subcortical structures, the brainstem, and the cerebellum. These label maps were brought into a common space, which allowed the calculation of proba-

45 CHAPTER 2. ANATOMICAL ATLAS BUILDING AND MAS 44 bility density functions for each structure [185]. Similarly, the Internet Brain Segmentation Repository (IBSR) 1 comprises 18 T1-weighted MR images of the whole brain and their corresponding manual segmentations, with 32 labels per map and the Non-rigid Image Registration Evaluation Project (NIREP) [49] provides a set of 16 topological atlases of healthy subjects with 32 annotated grey matter regions each for the evaluation of software tools such as non-rigid image registration algorithms. In order to investigate differences in regional and total brain volumes between preterm and term-born infants, a set of manually segmented topological atlases called ALBERT was constructed [92]. The 15 infant scans were delineated into 50 regions each. While most of the previous projects focused on the construction of atlases from healthy subjects, there have also been datasets for different pathological cohorts. A representative example includes topological atlases of patients with Alzheimers disease from the Open Access Series of Imaging Studies (OASIS) [141]. It contains a total of 416 subjects of which about 100 were diagnosed with Alzheimers disease (AD). A subset of images from healthy controls was used for the MICCAI 2012 [125] and MICCAI 2013 [17] challenges on MAS. An even larger repository of clinical and imaging data is provided by the Alzheimers Disease Neuroimaging Inititative (ADNI) [109], containing data from over 1880 subjects. However, manual segmentations are only available for a subset. In addition to cross-sectional datasets, both repositories also provide longitudinal data from a part of their cohorts, including nondemented and demented 1 The MR brain data sets and their manual segmentations were provided by the Center for Morphometric Analysis at Massachusetts General Hospital and are available at

46 CHAPTER 2. ANATOMICAL ATLAS BUILDING AND MAS 45 subjects with scans from at least two visits [140]. By adding time to the spatial 3D domain, 4D atlases can be created. The time domain allows to capture growth or disease progression and prediction making. The cross-sectional and longitudinal information allows the classification and comparison of healthy subjects to Mild cognitive impairment, cognitive impairment and AD. In general, population atlases also allow the classification by other attributes such as age or sex, leading to the construction of atlases based on group-specific criteria, which make it flexible enough to incorporate new images into the atlas later on. Furthermore it facilitates the comparison of parameters such as volume and shape of anatomical structures on population level Unbiased probabilistic atlas construction Since there is no single subject brain scan capable of representing a whole population and its large anatomical inter-subject variability, probabilistic atlases, constructed from a set of images, have emerged as the tool of choice in the representation, analysis and interpretation of population-based imaging studies. The optimal unbiased probabilistic atlas most representative of the dataset is the one that requires the minimum amount of deformation to all individual atlas images of the dataset. In the simplest case, a probabilistic atlas can be computed as the mean image of a collection of images, thereby representing the average anatomy of the population. Within the context of this definition, a population atlas refers to a 3D average of cross-sectional data without any notion of time or age of the subjects. In early methods, the anatomy of a single subject was used as a template for normalisation. All other population images were then brought into this same space. A limitation of this approach is that the choice of template introduces a bias towards the shape of its anatomy. One solution towards the use of unbiased brain atlases for determining differences in anatomical patterns was proposed by Thompson and Toga [209]. A target image is registered to all existing atlases and the resulting deformation fields are used to

47 CHAPTER 2. ANATOMICAL ATLAS BUILDING AND MAS 46 determine the probability distribution of corresponding points. This allows the calculation of the likelihood of the targets region to be in a similar spatial location to the corresponding atlas brains. Due to the warping of the target to all atlas images, all of them contribute equally. In order to avoid the registrations between each atlas and a new target, an unbiased atlas can be calculated by simultaneously registering all of the subjects to their average coordinate system [33]. This makes the selection of a reference subject unnecessary. In this approach the computation of the arithmetic average of small displacement fields (vector fields) is well defined. This is not the case in the large deformation setting where we operate in the high dimensional group of diffeomorphisms. Based on the theory of large deformation diffeomorphisms, which allows the estimation of the displacement by integrating a velocity field, a distance or similarity measure is defined and the problem can be stated as finding the image and associated coordinate system that requires the least amount of deformation to align with every subject of the population [113]. The solution for this minimisation problem can be estimated with a greedy fluid algorithm by iteratively updating the transformation for each image and updating the template as the voxel-wise arithmetic mean. The final deformation can then be composed of the transformations gained at each iteration. One drawback of this method is that the average arithmetic mean of different tissues that are not in correspondence might lead to biologically wrong structures. In order to solve this problem, SPM [14, 16] first requires approximate alignment of the images to pre-defined tissue probability maps. Once in the same space, tissue probability maps for GM, WM and CSF can be constructed for each of the individual images. These individual subject maps are iteratively registered to their average followed by the construction of a new average. The initial template is calculated as the intensity average of the GM and WM maps. The similarity measure for the registration is based on the SSD, which considers the difference between the likelihood of the individual GM probability maps and the GM mean, between the likelihood of

48 CHAPTER 2. ANATOMICAL ATLAS BUILDING AND MAS 47 the individual WM maps and the WM mean and the likelihood of the rest (1-prob(WM)-prob(GM)). Due to this procedure, the template construction and registration is not directly based on the image intensity values but rather on the tissue probability maps. Compared to the first mean, which is very smooth and blurry, the template becomes crisper with every iteration. The registration of each image provides a diffeomorphic deformation field to the average. On the one hand, the initial alignment to the pre-defined tissue probability maps and the concurrent use of the maps does not introduce false structures when averaging them. On the other hand, this initial alignment restricts the anatomical space, i.e. the space used for normalisation is not the average of the individual images. In the advanced normalisation tools (ANTS) package, a method for the symmetric-group-wise normalisation (SyGN) independent from an individuals space and unbiased by an individuals shape was proposed [23]. The template creation facilitates the use of the previously described SyN algorithm for registration. Diffeomorphisms with symmetric properties are computed for the alignment of the individual images and the algorithm does not require an initial template estimate. Thereby, both shape and appearance of the constructed template are unbiased by the individual images. The goal is stated as finding the template and transformations that increase image similarity and reduce the path length of the diffeomorphisms as stated by an energy function. The diffeomorphisms that map a template to the individual images and the template shape as given by another diffeomorphism are initialised as identity. Then each part of the energy function is optimised while the rest are kept constant. First the diffeomorphisms between each image and the fixed template are re-estimated. Then the template appearance is updated with fixed shape and diffeomorphisms. And finally the template shape is updated. The template used in the first iteration is the affinely registered average appearance image [194]. In more detail in one iteration the set of transformations between each image and template is computed. Then the template appearance is iteratively calculated by deforming each of

49 CHAPTER 2. ANATOMICAL ATLAS BUILDING AND MAS 48 the images with their corresponding inverse transformation. The gradients of the similarity term are calculated for each deformed image and averaged, which is followed by an update of the template. The shape is updated by averaging the diffeomorphisms between the template and each image. The new diffeomorphism can be applied to the template deforming its shape more towards the new average Creation of label maps Label maps are usually acquired by means of accurate manual delineation which is a very time-consuming task but has to be done only once for each image and can then be re-used without further manual interaction. The preprocessing steps for the creation of the LONI LPBA40 population atlas [185] require the alignment of the individual images to delineation space for which the ICBM MNI-305 space [78] was chosen. The registrations included the identification of ten landmarks in each image and the MNI brain atlas followed by rigid-body translations and rotations. The transformations with six degrees of freedom were applied to the corresponding images to bring them into MNI space. Bias fields were corrected with the N3 method and the brain was extracted with FSL BET. In the labelling process an expert assigns a pre-defined integer value to every voxel of pre-defined ROIs. Each value indicates the presence of a particular neuroanatomical structure at the specific location. In order to increase the overall accuracy and keep interand intra-observer variability to a minimum, trained experts follow a set of protocols that specify the structure to be delineated, a single plane of section (coronal, sagittal, axial) in which the structure should be labelled, and details about grey matter, white matter and sulci that should be included or excluded. These instructions are based on anatomical landmarks and provide visual samples with 3D cross-sectional views. Additionally, each of the segmentation steps with starting and end points are given. In the LPBA40 the grey matter was included in cortical structures. Pairs of experts were

50 CHAPTER 2. ANATOMICAL ATLAS BUILDING AND MAS 49 assigned to each anatomical structure and performed the delineations independently. The calculated volumes of both experts were compared for each structure using the corresponding Intraclass Correlation Coefficient (ICC) given as ICC = MSB MSW MSB+MSW with MSB being the mean squares between the volume measurements of the structure and MSW the mean squares of the volume measurements within each structure. In order to become a certified rater, the ICC scores had to exceed a certain threshold or rater pairs were retrained until a reliable result was achieved. In addition to the ICC, the Jaccard similarity index [108] was calculated. In contrast to the ICC, which assesses the interrater variation of the volumes, the Jaccard index provides a measure of overlap of the volumes (Section 2.5). Once a reliable ICC and overlap was achieved, the same structure was segmented in the remaining subjects. For voxels in overlapping or missing areas between structures, pairs of raters decided on and assigned one label. Similar to the LONI LPBA40, the Harmonized hippocampal Protocol (HarP) [35], which has been applied to a subset of the ADNI dataset consisting of 1.5T and 3T images, includes a series of pre-processing steps to align the subjects in the same space. Each hippocampus was segmented by five different raters and ICC and Jaccard overlap scores were calculated from their segmentation results of an independent sample of 20 subjects. The segmentation protocol comprises a set of guidelines for the delineation of the outer contour followed by filling the area encapsulated by it. This is done in the original space of the images with sub-voxel accuracy. In order to refine the contour, the resolution of the initial grid dimensions is increased by a factor of 10. These high-resolution images can then be resized into the initial low-resolution space. Due to the use of interpolation methods the resulting segmentation consists of probabilistic values between 0 and 1. For the final segmentation the images were binarised with voxel values of less than 50% being discarded. The segmentation protocol of the NIREP dataset [49] does not provide details about the raters or training, but refers to the literature for the delin-

51 CHAPTER 2. ANATOMICAL ATLAS BUILDING AND MAS 50 eation of neuroanatomy and describing its variation [7, 8]. The segmentation was performed in 2D, which makes for smooth delineations in the segmentation plane but rough edges viewed from other directions. In order to avoid boundary errors, the initial segmentation, which also contained white matter, was restricted to grey matter alone leading to ROIs with smooth boundaries on the surface and between grey and white matter. 2.5 Tools, datasets and evaluation strategies For the implementation and validation of the framework presented in this document the following tools and datasets were used. Images were re-oriented to match the orientations of MNI space using FSLs reorient function. Skullstripping was performed with BET, ROBEX or Freesurfer. Based on visual evaluation, overall Freesurfer showed the best performance and was subsequently used for all datasets. Images were affinely registered to the MNI- ICBM 152 brain atlas to bring them into a common space with uniform dimensions and voxel size. The nonrigid registration and population template construction was performed with SPMs DARTEL or ANTs SyN. Due to the higher computational complexity of SyN, for most of the evaluation DARTEL was used as specified in the following chapters. Extensive evaluation was conducted on the LONI-LPBA 40, IBSR, NIREP, ADNI with HarP protocol and MICCAI 2012 and MICCAI 2013 datasets (Table 2.1). All datasets provided individual atlases consisting of an intensity image and a label map each. Each of the datasets was randomly divided into atlases and test images for cross-validation. From each atlas subset a population atlas was created. The corresponding test images were sequentially nonlinearly registered to the population atlas. This is in large contrast to other recent methods [31], which used atlases and test images for the construction of the population atlas. However, considering that a bias towards the test images is introduced in its appearance and the accuracy of the corresponding deformation fields, the population atlas would have to be reconstructed

52 CHAPTER 2. ANATOMICAL ATLAS BUILDING AND MAS 51 Dataset name LONI- LPBA40 ADNI- HarP #of subjects Image dimensions Voxel dimensions #of regions x256x x0.86x1.5 (n=38) 0.78x0.78x1.5 (n=2) Age range #of males Scanner Scan protocol 135 n/a slice thick.= Siemens, Philips, MPRAGE n/a n/a GE, 1.5T & 3T IBSR x128x x1.5x0.84 (n=4) GE Signa (1.5T) SPGR x1.5x0.94 (n=8) 1.0x1.5x1.0 (n=6) NIREP x300x x1.0x n/a GE Signa (1.5T) SPGR 24 7 MICCAI x256x x1.0x n/a MP-RAGE n/a n/a 2012 MICCAI x256x x1.0x Siemens Vision MP-RAGE (1.5T) TR [ms] GE (1.5T) SPGR TE [ms] Table 2.1: Characteristics of the used datasets.

53 CHAPTER 2. ANATOMICAL ATLAS BUILDING AND MAS 52 for every new target image which is very time-consuming. Our evaluation strategy was guided by a real life scenario where test images would not be present at the time of population atlas creation. Consequently, it was constructed only once from the atlas images and re-used for every new target thereafter. The evaluation accuracy of individual label overlaps is also influenced by the way multiple adjacent labels which might be overlapping in the final segmentation results are handled. One way would be to consider each label as a separate entity with its label probability values. Another way is to combine individual labels into a single segmentation map. In the latter case overlapping areas would have to be assigned to one of the competing labels based on their label probability value. In our evaluation the second approach was chosen to provide a single label map consistent with the atlas label maps used as input. Due to the reliance of the MAS concept on image registration, both the evaluation of nonrigid registration methods and the quantification of segmentation accuracy are very closely related. In the literature, registration methods are commonly assessed with surrogate measures such as anatomical label overlap, tissue overlap, image similarity, and inverse consistency [195]. However, Rohlfing showed that only overlap measurements of small labelled ROIs could assess registration quality accurately [167]. Consequently, in our experiments the overlap of a label annotated by means of expert manual segmentation, which is considered as the gold standard, and the label assigned by the segmentation method under assessment was used for performance evaluation. Usually the region overlap is given either by the Jaccard similarity (JS) [108] or the Dice similarity coefficient (DSC) [71], both shown in equation 2.1 and related by DSC = 2 JS [66]. The JS of two overlapping JS+1 regions A and B is defined as the ratio of the size of the intersecting volume and the size of the union of the volumes. The DSC can be written as the ratio of the intersecting volume and the mean of the volumes. Applied to medical images the respective volumes are given by the enclosed number of voxels N().

54 CHAPTER 2. ANATOMICAL ATLAS BUILDING AND MAS 53 JS(A, B) = N(A B) N(A B) DSC(A, B) = 2 N(A B) N(A) + N(B)) (2.1) 2.6 Our approach to template building The first steps of our proposed method include pre-processing such as skullstripping, affine normalisation to a standardised space and tissue classification of the MR brain images. The resulting tissue maps from the atlases are used to create an average population template. Let us consider an a priori set of 3-D atlases A = {A j = (I j, {L r j} r=1...r )} j=1...n, where each atlas consists of an intensity image I j and R label maps {L r j} r=1...r, with L r j(x) = 1 (maximum probability) if voxel x belongs to label r in atlas A j and 0 otherwise (Fig. 2.1) Pre-processing We also ensure that x, r=1...r Lr j(x) 1, i.e. the label maps do not overlap. Next, we skull-strip all atlas images, {I j } j. BET, ROBEX and Freesurfer were tested with default parameters. The skull-stripped images are linearly registered with 12 degrees of freedom to a common space, that of the MNI152 MR brain atlas, using FSL FLIRT4.1. We obtain a set of affinely registered images, {Ĩj} j, and the corresponding affine transformations. We then estimate tissue maps, { G j } j, { W j } j, and { C j } j, with SPM s New Segment. Finally, we apply the affine transformations to the individual label maps to also bring them into the same MNI152 coordinate system: { L r j } r,j Estimating a population template We constructed average population templates with ANTs SyN template building method and SPM s groupwise registration algorithm DARTEL, both highly ranked in evaluation studies [122]. DARTEL iteratively creates in-

CHAPTER 2. ANATOMICAL ATLAS BUILDING AND MAS 54 G 0 " I " I " skull stripped I " 0 W 3 " A " L " L 0 " C 0 " affine transf. Figure 2.

The same affine transformation is applied to the corresponding atlas-label map.

As a byproduct we also get a set of deformation fields D = {DĨj I } j=1...n that precisely map each image and its tissue map to the corresponding population template.

Consequently, we get the same deformation fields between the atlas images and the image template as between the atlas tissue maps and their respective population templates.

55 CHAPTER 2. ANATOMICAL ATLAS BUILDING AND MAS 54 G 0 " I " I " skull stripped I " 0 W 3 " A " L " L 0 " C 0 " affine transf. Figure 2.1: The atlas image in original space is skull-stripped, linearly transformed into MNI space and segmented into tissue maps. The same affine transformation is applied to the corresponding atlas-label map. creasingly sharper population templates I, G, W and C from the set of affinely registered atlas intensity images and tissue maps. As a byproduct we also get a set of deformation fields D = {DĨj I } j=1...n that precisely map each image and its tissue map to the corresponding population template. Note that DARTEL creates all three population templates concurrently, using the tissue maps to improve the accuracy of the process. Consequently, we get the same deformation fields between the atlas images and the image template as between the atlas tissue maps and their respective population templates. SyN does not explicitly require tissue classified maps. Consequently only one template which contains all tissue classes is constructed. Similarly to DARTEL the output consists of the deformation fields that map each atlas intensity image to the population template. The DARTEL image template with all tissue classes can be created as the average of the warped atlas images, to allow for easier comparison to SyN s templates. The warped

The created deformation fields map the template to each individual atlas and the target. atlas images are obtained by applying each deformation field to the corresponding atlas image. 2.6.

56 CHAPTER 2. ANATOMICAL ATLAS BUILDING AND MAS 55 G % ' I # ' I # ( I # " G % ( G # " W % ' W % ( W % " C ' % C % (... C # " D, -., D, - 1, D, - 0, G I W* C D 6 5, G% 3 T5 W% 3 C# 3 Figure 2.2: Offline population templates for GM, WM and CSF are concurrently constructed from the corresponding atlas label maps. At runtime the target is registered to these population templates. The created deformation fields map the template to each individual atlas and the target. atlas images are obtained by applying each deformation field to the corresponding atlas image Transferring the label maps to the target At runtime the same pre-processing steps as outlined in Section are applied to the target image T to obtain an affinely registered target T in the space of the MNI152 MR brain atlas, and tissue maps G t, W t and C t. These tissue maps are nonlinearly aligned to the population templates I, G, W and C by estimating the deformation field D T I. The final transformation DĨj T between each atlas Ĩj and the target T can be approximated by composing D T I D 1 Ĩ j I. The corresponding atlas label maps {Lr j} r=1...r are transferred to the target by applying to them this composite deformation field.

CHAPTER 2. ANATOMICAL ATLAS BUILDING AND MAS 56 Figure 2.3: The same coronal slice of a subject s brain scan skull-stripped with FSL-BET (left), ROBEX (middle) and Freesurfer (right).

57 CHAPTER 2. ANATOMICAL ATLAS BUILDING AND MAS 56 Figure 2.3: The same coronal slice of a subject s brain scan skull-stripped with FSL-BET (left), ROBEX (middle) and Freesurfer (right). While BET misses parts of the brain stem, ROBEX removes too much of the GM. Freesurfer showed the best trade-off with smooth borders. 2.7 Experiments This approach was applied to the MICCAI 2012 dataset (Section 2.4.2) consisting of 15 atlas and 20 target images. The pre-processing steps, including skull-stripping and affine registration to MNI space, were performed for all images. The visual comparison of the results of three skull-stripping methods (Fig. 2.3) showed large under-segmented but also over-segmented regions by FSL- BET. Here we define over-segmentation as the removal of too much tissue that should have not been removed and, conversely, under-segmentation as the removal of too little tissue leaving parts that should have been removed. Under-segmentation was mainly observed around the brainstem while some scans were over-segmented where parts of gyri were removed. ROBEX performed consistently well in removing non-brain tissue, but we noticed slight over-segmentation. Small parts of the GM and in some scans WM were removed. Freesurfer provided the best specificity at high sensitivity without removing GM tissue. The DARTEL template creation process iterates through two main steps. Firstly, it refines the population template and secondly, it refines the defor-

58 CHAPTER 2. ANATOMICAL ATLAS BUILDING AND MAS 57 Figure 2.4: The same axial GM slice of the DARTEL template at different stages of the algorithm. With each iteration (from left to right and top to bottom) the population template, constructed from the individual atlases, becomes crisper and clearer. mation fields. Initially it starts with a coarse average of all atlases, which becomes clearer and crisper with every iteration (Fig. 2.4). A visual comparison of the final DARTEL template, created by averaging the warped atlas images, and the template, created by SyN, showed similar appearance (Fig. 2.5). SyN provided crisper borders with more details compared to the slightly blurrier template by DARTEL. The deformation fields between each atlas and the template, estimated with SyN appeared smoother compared to the deformation fields estimated with DARTEL (Fig. 2.6). However, in our runtime comparison DARTEL showed better performance than SyN on the same desktop PC. 2.8 Discussion Pre-processing tools for intensity inhomogeneity correction, and brain tissue segmentation and classification, are often used in the first steps of the MAS pipeline. Consequently, their performance affects the quality of the final segmentation. It can have an impact on offline learning, image reg-

CHAPTER 2. ANATOMICAL ATLAS BUILDING AND MAS 58 R L R L Figure 2.5: The same axial slice through the population templates created with DARTEL (left) and SyN (right). 8-8 Figure 2.

It is notable that the deformation field created with DARTEL is much crisper. istration, atlas selection and fusion (Fig. 1.

59 CHAPTER 2. ANATOMICAL ATLAS BUILDING AND MAS 58 R L R L Figure 2.5: The same axial slice through the population templates created with DARTEL (left) and SyN (right). 8-8 Figure 2.6: The same axial slice through the deformation fields in one direction, mapping a subject to the population template created by DARTEL (left) and ANTs (right). It is notable that the deformation field created with DARTEL is much crisper. istration, atlas selection and fusion (Fig. 1.6), all of which require similarity measurement between images or features derived thereof. We tested three different skull-stripping methods, FSL-BET, ROBEX and Freesurfer, and found differences in the quality of the estimated brain masks. FSL- BET showed under-segmented brain regions with default parameters and,

60 CHAPTER 2. ANATOMICAL ATLAS BUILDING AND MAS 59 even with manual parameter adjustments of the intensity threshold options, which allow estimation of a smaller or larger brain outline for the whole or parts of the brain, no ideal solution was achieved. ROBEX was the most user-friendly tool and, considering that it does not require parameter input, showed high precision in the region of the brainstem and in removing non-brain tissue. However, we noticed slight over-segmentation of cortical regions, which might introduce errors into the concurrent tissue classification and template construction. Freesurfer allows adjustment of multiple parameters for its combined watershed- and template-based approach and showed overall satisfactory results with its default configuration. Although the region around the brainstem was slightly under-segmented compared to the results by ROBEX, the cortical segmentation appeared more precise in our experiments. One of the challenges of skull-stripping is to find a robust method. Often the quality of the results is influenced by image contrast, artefacts, partial volume effects, anatomical variability and voxels with the same intensity values in brain and non-brain tissue. Consequently, the performance of skull-stripping methods might vary between different datasets or images, which makes a direct comparison and ranking challenging. One crucial task in the MAS pipeline is image registration. Nonlinear registrations between the atlases and the target cause the main computational burden at runtime, which represents a disadvantage of many MAS approaches. Based on the degrees of freedom and the underlying approach, registration methods show varying registration accuracy and computational efficiency. In general, the direct estimation of a high-dimensional dense deformation field is computationally expensive, but can provide high registration accuracy of local geometric differences. MAS approaches using only affine registration methods with a small number of degrees of freedom, such as patch-based techniques, can align images much faster but show inferior segmentation accuracy compared to nonlinear patch-based approaches. In patch-based techniques the main computational burden is shifted towards the label fusion, where

61 CHAPTER 2. ANATOMICAL ATLAS BUILDING AND MAS 60 the method iterates through voxels to assign a label to them. Other ways to improve computational efficiency include the use of better hardware and its parallel computing ability, which allows the computation of some tasks simultaneously. However, these resources are not commonly available to end-users yet and do not provide a solution to every task. Due to its computational efficiency, the use of a template for normalisation of atlas and target images has been of particular interest for our approach. Pre-computed population atlases are publicly available or can be constructed offline from a set of individual atlases. The use of the latter for MAS segmentation has shown to provide superior results, since it is can be constructed from images similar to the target. But current template-based methods still do not reach state-ofthe-art segmentation results compared to direct registrations between each atlas and the target. We tested two of the most commonly used registration and template construction methods, DARTEL and ANTs. The comparison of the resulting templates showed crisper borders and more detail in ANTs template compared to DARTEL s. Conversely, the corresponding deformation fields created with ANTs were smoother compared to DARTEL s, which could be related to the parameter configuration we used for DARTEL. This is further supported by Klein et al. s study [122], where the accuracy of multiple methods was compared and both ANTs and DARTEL were highly ranked. We chose DARTEL over ANTs for template creation and concurrent registration steps because it showed faster performance in our tests. Given the importance of image registration in MAS, Klein et al. and other authors [65, 122] also noted that not every brain can be one-to-one mapped to any other brain and, consequently, image correspondence should not be mistaken for anatomic correspondence. They further concluded that registration accuracy can be improved by using similar templates as intermediary registration targets, but finding a template which provides high correspondence in every ROI might be difficult. Another problem is the measurement of registration quality. Rohlfing tested several commonly used accuracy measurement methods including the label overlap between the transformed and target la-

62 CHAPTER 2. ANATOMICAL ATLAS BUILDING AND MAS 61 bels [167]. On a macroscopic level with large labels these measurements were not able to reflect the the degree of correspondence a registration method achieved. Rohlfing concluded that only the use of localised anatomical regions for overlap measurement can distinguish between the quality of registrations. This would require a dense set of landmarks, which is usually not available, or a set of small labels, delineated with high precision. The precise delineation of ROIs is also crucial for MAS, due to it its heavy reliance on expert segmentations. These labels are commonly acquired through means of time-consuming manual delineation. Although it is considered as the gold standard and detailed protocols for annotation are usually provided, they often suffer from inter- and intra-operator variability. In [105], Iglesias and Sabuncu pointed towards alternative methods to reduce the manual burden. Only those images with the largest potential to accurately segment a target could be manually delineated by experts. Conversely, Maier-Hein et al. [139] and Ganz et al. [86] showed that the delineation can also be performed by a larger number of non-experts, which has shown to produce high-quality annotations similar to expert results. They used a crowd-based approach for feature annotation in endoscopic images. Similar results were presented by Bogovic et al. [36], where a hierarchical cerebellum parcellation protocol for non-experts was tested. However, the question remains whether this is applicable to more complex anatomical brain structures and scans of varying quality. Since they used high-quality research scans, it is questionable if a same level of accuracy can be reached in the presence of artefacts. In conclusion, results from pre-processing methods should be carefully examined, since they can impact the accuracy of concurrent MAS steps. The use of a population template can improve computational efficiency at runtime and, when constructed from a set of atlases from the same cohort as the target, also improve quality of the target-template registration and segmentation accuracy. However, segmentation accuracy of traditional template-based methods are, in general, inferior to methods that use direct registrations. One

63 CHAPTER 2. ANATOMICAL ATLAS BUILDING AND MAS 62 reason is that the target could still be very different to the population template in some ROIs, even when the template was constructed from the same population as the target [122]. The quality of the registration between target and population template can be improved by constructing an intermediary target-specific template (TST) at runtime.

64 Chapter 3 Target-specific template 3.1 Introduction In the previous chapter the use of an average population atlas as a reference space for normalisation was presented. By using a reference space only one time-consuming nonlinear registration must be performed at runtime between the target and the average population template. The atlas images and the target are linked by composing their corresponding deformation fields allowing the propagation of labels from selected atlases to the space of the target where they can be directly fused [221]. However, the construction of the average popoulation template is performed offline, without the knowledge of the target. Consequently, if the set of atlases and, in turn, the population template are morphologically dissimilar to the target, the registration between the two images might be imprecise [168, 239]. With the similar motivation to reduce the number of registrations online but also to turn a potentially large, difficult registration task into easier, more accurate tasks, the construction of a graph or tree-like structure offline has been proposed [46, 88, 96, 110, 111, 118, 147, 202, 225, 234]. Each node represents an atlas image with deformation fields providing the links between them. The nodes are organised based on their similarity to each other and all deformation fields in the tree are pre-computed offline. A new target can be 63

65 CHAPTER 3. TARGET-SPECIFIC TEMPLATE 64 efficiently added to the most similar atlas in the structure. The information can then be propagated from any node in the structure to the target by composing the deformations that link the corresponding nodes along the path. Each node along the path between the target- and an atlas node can be seen as an intermediate template to refine a potentially large deformation. One representative example is Wolz et al. s [234] atlas propagation framework, where a manifold is learned from all atlases and unlabelled target images. Only similar atlases within a specified neighbourhood around each target were used for multi-atlas segmentation. A similar approach with the goal to reduce the risk of misregistration by refining large deformations into a sequence of smaller more accurate registrations via similar intermediate images was proposed by Gao et al. [88]. Each edge between nodes represents a weight defined by the pairwise image similarity. In contrast to Wolz s method, the information from all atlases was propagated along the optimal geodesic paths to the target images and a patch-based label fusion method was used to produce the final segmentation. A more generic model for the propagation of voxel-wise annotations between images in a spatially-variant graph structure was presented by Cardoso et al. [46]. A similar concept was used by Tang et al. [202] to improve registration accuracy between images. They used principal component analysis (PCA) on a training set of deformation fields, acquired by registering individual images to a selected template, to capture the statistical variation of the deformations. Each of the training deformation fields as well as the target deformation field, acquired by registering a target image to the same template, could be characterised by a small number of parameters in a low dimensional space. The most similar deformation fields from the training set were determined by comparison in this low dimensional space, which allowed the approximation of the target deformation field and construction of an intermediate template, further used for refinement. One criticism of this concept was identified in a study by Sjoeberg et al. s [188] where the impact of intermediate templates for linking the target and the remaining atlases on the resulting segmentation quality was investigated for

66 CHAPTER 3. TARGET-SPECIFIC TEMPLATE 65 a varying number of composed deformation fields and compared to a single direct registration. Images from a set of atlases with abnormal anatomy were randomly selected to serve as intermediate templates. It was shown that the composition of deformation fields in a multi-atlas framework can lead to a decrease in segmentation accuracy almost linear to the number of links compared to direct registration. Although results could be improved by ranking intermediate templates based on their similarity to allow more accurate registration, direct registration was superior. In the study, atlases of diseased patients with head and neck tumors were used, which makes the registration even more challenging. However, the use of composed deformation fields reduced computation time by approximately 2/3. Overall, they concluded that improved computational efficiency outweighs the differences in segmentation quality, making it a feasible method in a clinical setting. An alternative method that aims to overcome the criticisms of using a population template or a graph-structure was presented by Commowick, Warfield and Malandain [59, 61, 161]. An average population template is constructed as a reference space from the atlases offline. At runtime the target is registered to this reference space. The locally most similar atlases to the target are identified by comparing their respective deformation fields under the hypothesis that more similar images should have similar deformations. A subset of these atlases is selected for the construction of a TST by iteratively registering the atlases to their average, constructing a new average template and applying the inverse of their average transformation to this template. This strategy represents the local extension to the average template construction method proposed by Guimond [95]. The target can then be registered to the TST. Assembling a TST from similar atlas ROIs is expected to yield a registration of high quality to the target since the locally most similar atlases to the target are selected, rather than the whole atlas images with no guarantee that those would be the most similar for each ROI. At runtime it requires only two nonlinear registrations, one between the target and the average population template and one between the target and the

67 CHAPTER 3. TARGET-SPECIFIC TEMPLATE 66 TST. The atlas labels can be propagated to the target by applying the composed deformation fields between the corresponding atlas and the TST and between the TST and the target. The advantage of using a TST has also been shown for atlas building in the presence of artefacts [151] and for multi-organ segmentation [235]. Shi et al. presented a similar strategy for constructing neonatal brain atlases at different stages of development [187]. Individual atlas images are spatially normalised by creating a population template from them. This population template is parcellated into smaller ROIs with a watershed algorithm. The atlases are clustered into sub-populations for each ROI, identifying groups of similar shapes with one atlas in each group being determined as a representative exemplar for the group. For each ROI and for each group an average template is constructed. A given target is normalised to the initial whole brain average population template and for each ROI the most similar exemplar and consequently, sub-population is determined based on MI similarity. The TST is then assembled from the corresponding regional average templates. Due to the use of regional average templates, constructed from similar sub-populations, the final TST is more similar to the target than an average whole-brain template. The comparison of images plays a crucial role in finding those atlases with the highest chance of producing a similar TST to the target and an accurate segmentation. This comparison can be performed with different metrics for quantifying the similarity, based on different comparison basis such as image intensities or age, and on different scales. However, as mentioned in the previous chapter, a high image correspondence should not be confused with high anatomical correspondence [65, 122]. Similarly, registration accuracy can usually only be quantified with surrogate measures [167]. Multiple surrogates were tested by Rohlfing and only label overlap scores of localised anatomical regions were able to provide a reasonable estimate. An accurate evaluation would require a dense set of landmarks, which is usually not available. In the next section we will provide an overview of different comparison strategies.

68 CHAPTER 3. TARGET-SPECIFIC TEMPLATE Comparison basis Atlas selection is done on the basis of the comparison between the target image and the atlases. A number of criteria have been investigated in the literature. They can broadly be grouped into image-based and non-image based. Image-based criteria include features directly extracted from the given image or features indirectly extracted from image derivatives after processing the original. Examples for the former are image intensities or normalised image intensities, and for the latter, 3D surface meshes, registration consistency or deformation fields. Non-image-based comparison can be based on demographics such as age, sex or pathology Image intensities The selection of the best candidates is most commonly based on the similarity of atlas and target intensity images after alignment [4, 5, 10, 212, 239, 240]. In general, methods based on nonlinear registrations of the atlas images to the target, e.g. [74], outperform methods based on affine registered images to the target, e.g. [64]. The former is computationally more expensive but provides better alignment, while the latter is more efficient but less accurate due to the poor alignment. Rikxoort et al. [217] presented a method with both characteristics. They efficiently aligned all atlases to the target with a fast affine registration, performed the ranking and selection, and used a computationally more expensive nonlinear registration only to align the selected atlases more accurately to the target Non-image information The selection of the best candidates for a specific target can also be based on non-image information such as meta-data including age, sex or pathology. In [5], it was shown that age-based selection can achieve similar segmentation accuracy compared to the best results from intensity-based selection meth-

69 CHAPTER 3. TARGET-SPECIFIC TEMPLATE 68 ods in young and middle aged groups. However, for the older aged group, age-based selection showed significantly better results. In a similar study, Aribisala et al. [10] investigated the impact of atlas selection for an older subject cohort. They found that the selection of different single atlases from the age-matched cohort barely had an impact on segmentation accuracy. However, the selection of a young adult brain atlas instead of age-matched atlases, led to systematic segmentation errors Registration consistency Another atlas selection strategy was presented by Heckemann et al. [100], where the segmentation quality was measured based on two registration and transformation steps. A deformation field is estimated by registering an atlas- to the target image. This deformation field is applied to the corresponding atlas label to transform it to the target. The resulting transformed label is propagated back to the atlas image by applying to it the deformation field, estimated in a second registration between the target and atlas image. Registration and interpolation errors, introduced in this procedure, lead to differences between the original label and the forward-and-backward propagated label in the anatomical space of the atlas. These differences are measured with the Dice overlap coefficient and used as a quality measure. The atlases were ranked for each ROI according to their Dice overlap measures. A certain number of the highest ranked atlases were selected for fusion. Experiments were performed with the spline-based free-form deformation [177] with a control point spacing of 2.5 mm. Although they achieved higher Dice overlap scores compared to the use of randomly selected atlases, it should be noted that registration consistency does not provide a surrogate for measuring registration accuracy as shown by Rohlfing [167].

70 CHAPTER 3. TARGET-SPECIFIC TEMPLATE Anatomical geometry An alternative atlas selection strategy for the segmentation of lymph node regions from CT scans was proposed by Teng et al. [204]. In contrast to other methods, the selection was based on landmarks extracted from 3D volumes of anatomical structures. The volumes were automatically segmented with a combined thresholding and active contouring method resulting in the delineated anatomical structure and a 3D surface mesh. Three different types of features were extracted from each anatomical landmark structure, including the structure s volume and its extent of the overall head and neck region, the vectors describing the structure s relative location to each other, and shape properties given as surface meshes. The feature vectors were used to create a weighted Euclidean distance matrix of the atlases to the target. One or multiple atlases with the shortest distances were selected and registered to the target. To achieve accurate registration, the 3D surface meshes were used as landmarks and incorporated as additional information in the registration between atlases and targets. The similarity measurement between the shapes of two meshes was calculated by the Hausdorff distance with smaller distances indicating a higher correspondence. The similarity ranking determined based on feature vectors was compared to the ranking determined by registering all atlases to the target. Results showed that 81% of the most similar candidates, selected based on features, matched the candidates, selected based on image registration, with a 96% chance of the top candidate to be within the top three subjects Deformation fields The selection of the best candidates is most commonly based on image intensities, which has shown to provide a reliable basis for similarity measurement in high quality research scans. However, due to the non-quantitative image acquisition method used in MRI, image intensities are strongly influenced by the sequence, scanner, MRI system, coil and image reconstruction method.

71 CHAPTER 3. TARGET-SPECIFIC TEMPLATE 70 Artefacts such as intensity non-uniformities, movement artefacts and partial volume effects can be introduced due to the scanner hardware or the subject. The presence of pathology further complicates image intensity classification and label affiliation. One approach, which has shown promising results for candidate selection, especially in the presence of pathology, is based on the comparison of deformation fields. The concept of using deformations for multivariate analysis has first been applied by Bookstein [37], who suggested the use of decomposed deformations, called principal warps, for multivariate statistical analysis. In the context of medical image analysis, the distribution of these principal warps allowed the discrimination and quantification of pathological pattern severity. The technique built the basis for the description of shape differences within or between subjects through their corresponding deformations. Thompson et al. [208] developed a framework to analyse the variability of cortical shape patterns and structural variations, and to classify deformity as normal variation or pathological abnormality. The deformation fields between subjects and an average population template in a normalised space were estimated to both bring them into correspondence, and carry information about their shape differences. Magnitude and direction of these local covariance tensors could be statistically analysed to provide probability values. A confidence region for the distribution of each anatomic point can be calculated in displacement space, indicating the likelihood of finding the cortical structure of interest at the specified anatomic location which, in turn, allows the characterisation of unusual anatomical positioning [206, 207]. The basic concept of classifying shape differences based on their deformation fields was employed by Commowick, Warfield and Malandain [60, 61] to select the best candidates for the segmentation of a target image. They hypothesise that the best result is obtained by selecting the most similar atlases to the target in displacement space, rather than intensity space, under the assumption that more similar images should have more similar deformations. Their objective function is to find the candidate Ĩ from the atlas images I k, that requires the least amount of deformation D T Ik to nonlinearly register to the

72 CHAPTER 3. TARGET-SPECIFIC TEMPLATE 71 target T as measured by the Euclidean distance : Ĩ = arg min d (I k, T ) I k = arg min D T Ik Id (3.1) I k While this approach requires the nonlinear registration of each atlas to the target, they proposed a computationally more efficient method. Instead of registering every atlas directly to the target, an average population template M was constructed offline. The resulting deformation fields D M Ik each atlas I k and M can be compared to the deformation field D M T estimated between the target and M at runtime. It is assumed that T T Ik be composed by T T Ik D M Ik D 1 M T : between can d(i k, T ) = DM Ik D 1 M T Id = (DM Ik D 1 M T )(i) Id i (3.2) 3.3 Similarity metrics Once a basis for comparison has been selected, a similarity or dissimilarity metric can be used to rank the atlases. While Wu et al. [168, 239] and Avants et al. [23] investigated the impact of the selection of the average population template and its construction on the segmentation accuracy, similar studies have been conducted for the selection of individual atlases for propagation to and segmentation of a target [1, 5]. The most commonly used metrics includind sum of squared differences (SSD), mutual information (MI), normalised mutual information (NMI) and correlation coefficient (CC) were compared. Aljabar et al. [5] nonlinearly registered all MR brain images to a common template space, where the similarity measurements with the above mentioned methods were conducted. Reference accuracy values were calculated by first ranking the atlas labels based on their Dice overlaps with the target labels and then fusing them into a final segmentation. It shall be noted that this is impossible in a real-life scenario where the target labels are

73 CHAPTER 3. TARGET-SPECIFIC TEMPLATE 72 unknown. In their comparison NMI showed generally the best results. SSD was identified as the least reliable metric and CC and MI varied in between. In contrast, Acosta et al. [1] compared CC, SSD and MI on a set of nonlinearly registered pelvic CT scans and identified SSD and CC as the most reliable. Although both studies used image intensities as a similarity basis, the use of different modalities, registration methods, fusion strategies and anatomical structures makes the comparison of these studies difficult. But it does outline that atlas selection and the choice of similarity measurement poses a challenge. Another interesting study about the impact of the ranking on segmentation accuracy was presented by Ramus and Malandain [162]. They evaluated different ranking strategies independently from the number of selected candidates and atlas fusion method. The ranking strategies included CC, SSD, MI, and NMI applied to nonlinearly and affinely registered intensity-based methods, deformation-based methods, reference methods based on overlap and distance measures, and random ranking. All ranking methods were clustered and Spearmans rank correlation coefficients calculated to outline sub-groups of equivalent ranking methods, and the average correlation between the rankings of automatic methods and the ranking of the reference method. The resulting main clusters included the group of reference methods on one side of the spectrum and the random ranking on the other side of the spectrum. In between, intensity-based methods after nonlinear registration formed a group and intensity-based methods after affine registration formed a group with deformation-based methods. Results showed that the correlation between the reference and the clusters containing intensity-based methods after affine registration and deformation-based methods, was higher than between the reference and all clusters containing intensity-based methods after nonlinear registration. This suggests that deformation-based methods are more appropriate than any of the similarity metrics applied to intensitybased methods after nonlinear registration.

74 CHAPTER 3. TARGET-SPECIFIC TEMPLATE 73 More complex similarity measurement methods, which have shown to be superior over conventional measurement methods, are based on distances between projections of atlas and target images in low-dimensional manifold space. In an evaluation by Duc et al. [75] three different manifold learning methods, including Isomap, Laplacian Eigenmaps and local linear embedding, were compared on a set of MR brain images with manually segmented hippocampus labels. The goals of the study were to find the best technique for atlas selection and the best choice of the manifold parameters. Local linear embedding with 11 dimensions reached the highest overlap measure, but they concluded that the choice of method and parameters should be determined empirically, since it also depends on the dataset and the number of atlases used. Overall results reached state-of-the-art accuracy or exceeded results obtained with conventional selection methods Basic similarity metrics The SSD is ideally used if the difference between the two images is only Gaussian noise, because of its sensitivity to outliers such as intensity inhomogeneities. Consequently, without further intensity mapping, it should only be used for images of the same modality with an identity relationship between their intensity ranges. For two images, a target T and an atlas A M, with N voxels the SSD can be calculated as SSD = 1 T (x) A M (x) 2 (3.3) N xɛω Similarly, CC can only be used for mono-modal image comparison. In contrast to SSD, the intensity ranges of the images can differ but require a linear relationship. The CC of two images with their respective average intensities µ T and µ AM can be calculated as

75 CHAPTER 3. TARGET-SPECIFIC TEMPLATE 74 CC = for every voxel x. (T (x) µ T )(A M (x) µ AM ) xɛω ( ( ) (3.4) (x) µ T ) xɛω(t 2) (A M (x) µ AM ) 2 xɛω A method less sensitive to different intensity ranges, which makes it suitable for the comparison of images of different modalities, is MI or NMI. Its goal is to find the amount of shared information in two images. The amount of shared information in perfectly aligned images is a minimum and increases the more dissimilar they are. For instance, the difference image of two perfectly aligned images is a uniform image with zero entropy, which increases with misregistration. These information theoretic techniques compute the similarity of two images from the frequency of their corresponding joint histogram. MI of an atlas image A and a target T can be expressed as MI = H(A) + H(T ) H(A, T ) (3.5) with H(X) being the marginal entropy of an image X, given by H(X) = x p X (x)log(p X (x)), (3.6) the joint entropy, which is calculated from the probabilities of pairs of image values occurring together, given by H(A, T ) = p A,T (a, t)log(p A,T (a, t)) (3.7) a t and with p A,T describing the joint probability distribution of a voxel associated with images A and T. To make MI more robust, normalised mutual information (NMI) has been introduced and can be calculated with NMI = H(A) + H(T ) H(A, T ) = MI(A, T ) H(A, T ) + 1 (3.8)

76 CHAPTER 3. TARGET-SPECIFIC TEMPLATE Manifold learning With a growing amount of population-based MRI studies, machine learning and pattern recognition algorithms have become important tools in computational neuroscience. These data driven approaches encompass filtering algorithms, measures for determining coherence or relations in the data and classification strategies. One class of methods of particular interest for medical image analysis and especially for registration, segmentation and classification is called manifold learning and facilitates dimensionality reduction and projection techniques [6]. Conventional similarity measurement methods from the previous section operate in high-dimensional feature space with as many dimensions as there are features, where the number of features is determined by the comparison basis. In contrast, manifold learning techniques allow the projection of data from a high to lower dimensional representation while respecting the intrinsic geometry of the data. Applied to medical images and image intensity as the similarity basis, this can be understood as follows. An image with 8 million voxels can be represented as one point in a space with 8 million dimensions, with coordinates given by the intensity values of each voxel. Due to similarities in images of a certain population, their representations in this high dimensional space can be seen as a cloud of points lying very close to each other. This means that many fewer dimensions might be necessary to represent this certain group of images, allowing these points to be embedded in a low dimensional or sub-manifold in the high dimensional space. Manifold embedding methods are able to find patterns in the data based on similarities and differences in the data and can thereby significantly reduce the amount of necessary dimensions for their representation and further processing. In the context of atlas selection this has some important implications. As mentioned previously, similarity is conventionally quantified with some type of metric, which is capable of measuring the distance between two points in the high-dimensional space. For instance, SSD is the sum of squared differ-

77 CHAPTER 3. TARGET-SPECIFIC TEMPLATE 76 ences between the respective coordinates of two images in each dimension. Considering the images are close to or lie on a lower-dimensional manifold, Euclidean-based distance measures in high-dimensional space are not always a meaningful representation of similarity. In contrast, the use of the geodesic distance, calculated on the manifold structure, is more representative and can be approximated by the Euclidean distance in a lower dimensional manifold space. An important property of manifold learning methods is their ability to preserve the intrinsic geometry of data when projecting them from high-dimensional space into lower-dimensional manifold space. In the last two decades, different manifold learning strategies have been developed, each with the goal to preserve a different geometrical property by optimising some objective function [6, 45, 75]. In summary, nonlinear methods such as Isomap and Locally Linear Embedding (LLE), compared to linear methods such as Principal Component Analysis (PCA), are able to find more complex patterns in the data and preserve the intrinsic geometry [89, 96]. Although these are huge advantages over linear methods, they do not come without drawbacks. The need for the initial generation of a graph, connecting all data points, is computationally expensive, considering the high dimensional space of about 24 million dimensions in case of deformation fields. In contrast, PCA is only based on second order statistics, given by the covariance matrix, which makes them more feasible in some real world applications [110, 202, 234]. Overall, it has been shown that manifold embedding methods can reduce the dimensionality, while still preserving the geometry of the data, and allow more accurate similarity measurement between images on a low-dimensional manifold embedded Principal component analysis (PCA) PCA [112] is a classic method to highlight similarities and differences by identifying patterns in high dimensional data. An efficient implementation for the application to images was proposed by Turk and Pentland [213], where

78 CHAPTER 3. TARGET-SPECIFIC TEMPLATE 77 the goal is formulated as finding the principal components of the images that account for the largest variation. Mathematically it can be explained as finding the eigenvectors and the corresponding eigenvalues of the covariance matrix of the images. By sorting the components according to their eigenvalues in a decreasing order, a set of dimensions can be identified whereby the first principal component represents the dimension that accounts for the largest variance in the data, the second for the second largest variance and so on. The number of dimensions can be reduced by eliminating redundant information and ignoring the components that account for the least amount of variance, without losing too much information. The original images and new targets can be projected into this eigenimage space, where they can be compared and reconstructed from the eigenimages. Because redundancy is measured by correlations, a drawback of this method is its dependency on only second order statistics. Consequently, it lacks information for higher order statistics and is therefore classified as a linear dimensionality reduction method. Given a set of M mean-corrected images in matrix form A = [Φ 1, Φ 2...Φ M ], the covariance matrix can be calculated as C = 1 M Φ n Φ T n = AA T. (3.9) M n=1 With large images, finding the eigenvectors of C can become computationally expensive. The efficient method presented by Turk and Pentland [213] assumes that the number of images is smaller than the number of dimensions, which allows the calculation of the eigenvectors v i of the much smaller matrix A T A such that A T Av i = λ i v i. (3.10) To determine the eigenvectors of C, both sides can be multiplied by A which leads to

79 CHAPTER 3. TARGET-SPECIFIC TEMPLATE 78 AA T Av i = λ i Av i (3.11) where Av i are the eigenimages u l of C = AA T. A mean-corrected image Φ can be projected into eigenimage-space by w k = u T k Φ for k = 1,...,M (3.12) where w k represents the contribution of the k-th eigenimage in the reconstruction of the target. The number of eigenimages M for the projection is variable, with more eigenimages accounting for more of the variance Isomap A method capable of discovering nonlinear degrees of freedom in the data, while preserving their intrinsic geometry, is the manifold embedding technique Isomap [203]. The goal of the algorithm is to find a low-dimensional representation of the data, that best preserves the distances between the data-elements measured in high-dimensional space. This can be exemplified by comparing geodesic and Euclidean distances between two data points. For example, if we consider a spiral of data points in 3D, two points might be close to each other in terms of their Euclidean distance but far apart, following the embedded manifold (Fig. 3.1). In order to capture this nonlinearity, a distance matrix, which contains the geodesic distances between each of the data items, is created. This is done via a neighbourhood graph that connects each data item to a specified number of closest neighbours, as measured by Euclidean distances in high-dimensional space. Then the shortest paths between all pairs of elements of the graph can be estimated and a lower-dimensional embedding generated. Similarly to PCA, the embedding is constructed by calculating the principal components and projecting the data points onto it. The objective function of Isomap is:

80 CHAPTER 3. TARGET-SPECIFIC TEMPLATE 79 Figure 3.1: The distance on the manifold (solid blue line) differs from the Euclidean distance (dotted blue line) [203]. n n Φ(Y ) = (gij 2 y i y j 2 ) (3.13) i=1 j=1 with g ij representing the geodesic distance between two data points, e.g. atlases a i and a j from a set of n atlases A = (a i,..., a n )ɛr D in a highdimensional space. Y = (y 1,..., y n )ɛr d is the new transformed dataset in the low-dimensional space Local linear embedding (LLE) Another unsupervised learning algorithm for the computation of low-dimensional, neighbourhood-preserving embeddings of high dimensional input data is LLE [175]. While Isomap aims to preserve the geodesic distances between pairs of data points in high-dimensional input space, which requires the pairwise geodesic distance measure between all points, LLE represents the local geometry of patches by reconstructing each data point from linear coefficients of its neighbours. It is assumed that there is a sufficient number of data points, so that each point is close to its neighbours and a patch on the manifold. Due to this local reconstruction, pairwise distance measurements outside the respective local patch size are not required. The objective function

81 CHAPTER 3. TARGET-SPECIFIC TEMPLATE 80 Figure 3.2: Data points and patches on manifold in 3 dimensions [175]. Φ(Y ) = n y i i=1 jɛn k (i) w ij y ij 2 (3.14) measures the cost presented by the reconstruction error and is minimised by solving a least-squares problem with constraints on the weights of each of the neighbours' contributions. The weights w ij represent the contribution of data point j to reconstruction i where data point j has to be within the neighbourhood and the weights sum up to one. This leads to symmetry for any particular data point and its neighbours, making them invariant to rotations, rescaling, and translations, which preserves the intrinsic geometry. 3.4 Our approach to TST building We build our TST image from the atlas images combined in such a way as to satisfy the following two constraints: (a) the TST should be as similar to the target as possible, to make it easier to estimate an accurate non-linear mapping between them and (b) the number of composed deformation fields should be limited to reduce the risk of an imprecise transformation. In our approach we employ an average population template for computational efficiency (Chapter 2) and construct a TST locally similar to the target image by working at the label level, merging together regions from those atlas

82 CHAPTER 3. TARGET-SPECIFIC TEMPLATE 81 images selected for their similarity to the corresponding region in the target. Compared to other methods, our similarity measurement is based on the L 2 distance in nonlinear manifold embeddings constructed from the regions of deformation fields and the TST is constructed from the warped atlases in the space of the target. This has the following advantages: (1) using deformation fields rather than image intensities makes for a similarity measure much less influenced by the intensity artefacts and inhomogeneities frequently found in MR images, (2) using label-specific nonlinear manifolds makes it possible to better take into account the local geometry of the atlas and target images and (3) assembling the TST from parts of the warped atlas intensity images in the space of the target makes for an even more similar TST. The TST serves as an intermediate template and registering the target to the TST is expected to refine a potentially large, difficult registration between the target and average population template. This adds one nonlinear registration at runtime and, in turn, one deformation to the composition. However, this registration is performed directly, rather than via intermediate images, and between two images very similar to each other, the TST and the target. The atlas labels can be propagated to the target by composing the deformation fields between the corresponding atlas and the average population template, between the population template and the TST and between the TST and the target. Consequently, the number of composed deformation fields is constant and does not depend on the number of atlases Manifold embedding For each label r we build a non-linear manifold space from the corresponding region of the deformation fields estimated between the atlas images and the population template, and between the target and the population template. This region, B r, consists of the minimum set of all voxels with nonzero probability across the individual label maps { L r j } r, i.e. B r = {x j [1, N], L r j (x) > 0}. Note that this makes for potentially overlapping regions

83 CHAPTER 3. TARGET-SPECIFIC TEMPLATE 82 Label 1 Label 2 Label R Atlas 1 λ 11 λ 12 λ 1R Atlas 2 λ 21 Atlas N λ N1 λ NR Figure 3.3: A manifold is built for each label from the corresponding ROI of the deformation fields. The distances in manifold space provide a measure of dissimilarity. across labels, which has to be taken into account when assembling the TST and fusing the labels. Following the recommendation of Duc et al. [75], we considered and empirically tested both LLE and Isomap and selected the latter as it exhibited the best overall results. The parameters were fine-tuned by systematically varying the number of neighbours and dimensions, as they did. We extract the set of displacement vectors in the region from both the atlas deformation fields, {DĨj I } j=1...n, and from the target deformation field, D T I. The resulting sets of displacement vectors, {U r = {DĨj Ĩ j I (x) x B r }} j=1...n and U r T = {D T I (x) x B r } respectively, are then mean corrected and we compute their pairwise L 2 distances to serve as input for the Isomap algorithm. For each label, we get a projection of U r T and of the {U r } Ĩ j onto the lower dimensional manifold space (Fig. 3.3). We can then j calculate the L 2 distances between the target label projection and the atlas label projections and rank them. Overall, we get a distance matrix, Λ, with

R columns, one per label, and N rows. 3.4.

This satisfies constraint (b) and results in a set of warped intensity images Îj i, from which we extract the intensities in region B

84 CHAPTER 3. TARGET-SPECIFIC TEMPLATE 83 Atlas 1 λ 11 λ 12 λ 21 Atlas 2 λ 22 λ N1 Target-Specific Patient-specific Template atlas PSA TST... Atlas N λ 32 Figure 3.4: The distances in manifold space are used as weights for the construction of the TST from regions of the warped atlas images. R columns, one per label, and N rows TST construction For each label r we warp the atlas intensity images Ĩj i on the target T by composing DĨji I and D T I 1. This satisfies constraint (b) and results in a set of warped intensity images Îj i, from which we extract the intensities in region B r. Here we use the normalised manifold distances as weights for the candidate segmentations: ωi r = 1 λr i where N j λr j=1 (wr j) = 1 to compute a i weighted sum H r = { N j=1 ωr j Îj i (x) x B r } (Fig. 3.4). Finally, we assemble the TST by putting together the {H r } r. Two special cases can occur: when regions overlap, we average the intensity values of the corresponding voxels H r and for regions that are not part of a label we use the intensity value of the target image.

85 CHAPTER 3. TARGET-SPECIFIC TEMPLATE 84 eigenimage 1 eigenimage 2 eigenimage 3 eigenimage 4 eigenimage 5 Figure 3.5: The five highest ranked eigenimages in decreasing order. 10 eigenimages 20 eigenimages 30 eigenimages 39 eigenimages Original Figure 3.6: Reconstruction of an atlas image from the training set with an increasing number of eigenimages. 3.5 Experiments Reconstruction of a GM image slice with eigenimages The 40 images of the LONI dataset were skull-stripped, affinely registered and classified into their tissue maps. The same axial centre slice was extracted from all GM tissue maps resulting in 40 2D images. 39 images were used as atlases for training and one image as target for testing. As outlined in Section , the eigenimages were determined from the atlases and a variable number of them used to reconstruct an atlas from the training set, to reconstruct the left-out target and to find the most similar atlas to the target image. The eigenimages, illustrated in Fig. 3.5 were calculated from the training set and ranked in decreasing order of their corresponding eigenvalues. A random atlas image from the training set was reconstructed with the 10, 20, 30

CHAPTER 3. TARGET-SPECIFIC TEMPLATE 85 10 eigenimages 20 eigenimages 30 eigenimages 39 eigenimages Original Figure 3.

86 CHAPTER 3. TARGET-SPECIFIC TEMPLATE eigenimages 20 eigenimages 30 eigenimages 39 eigenimages Original Figure 3.7: Reconstruction of a non-atlas target image with an increasing number of eigenimages. Target Most similar atlas Figure 3.8: Target and the most similar atlas image identified in the space of eigenimages. Figure 3.9: Cumulative variance explained by the eigenimages. and 39 highest ranked eigenimages, which account for 38% of the variance (Fig. 3.9). The reconstruction with 10 eigenimages showed poor quality and similarity to the original but with an increasing number of eigenimages a

87 CHAPTER 3. TARGET-SPECIFIC TEMPLATE 86 higher degree of detail and similarity to the original image was added (Fig. 3.6). A perfect reconstruction was achieved with all 39 eigenimages, which accounts for 100% of the variation in the training set. An acceptable reconstruction can usually be achieved with eigenimages accounting for a variance of at least 65%. The reconstruction of a target image not in the atlas set showed only little improvement with the use of more eigenimages (Fig. 3.7). One reason might be the large morphological inter-subject variability shown in MR brain images and the coarse alignment of corresponding anatomical structures. Consequently, the high variation can not be captured by the small number of 39 atlases and, in turn, eigenimages Reconstruction of a deformation field with eigenimages The 40 images of the LONI dataset were pre-processed as outlined in Section 2.6. The population template was created from the tissue maps of 30 randomly selected images. Each of the 10 remaining images was nonlinearly registered to this same template. The corresponding deformation fields, estimated during the registration process, were used as atlases for training and targets for testing respectively. The eigenimages were determined from the atlas set and a variable number of them used to reconstruct an atlas from the training set, reconstruct a left-out target and find the most similar atlas to a target. The inverse of the reconstructed deformation field was applied to the population template to obtain the individual target tissue map and compared to the original. The eigenimages were calculated from the training set and ranked in a decreasing order according to their corresponding eigenvalues (Fig. 3.10). Similarly to the results from the previous experiment, an atlas from the training set was reconstructed with the 20 highest ranked eigenimages, which account for over 70% of the variance in the training set (3.14). The reconstruction showed a high level of similarity to the original (Fig. 3.11), which could be

88 CHAPTER 3. TARGET-SPECIFIC TEMPLATE 87 eigenimage 1 eigenimage 2 eigenimage 3 eigenimage 4 eigenimage 5 Figure 3.10: The five highest ranked eigenimages in decreasing order. 10 eigenimages 20 eigenimages 25 eigenimages 30 eigenimages Original Figure 3.11: Reconstruction of an atlas from the training set with an increasing number of eigenimages. 10 eigenimages 20 eigenimages 25 eigenimages 30 eigenimages Original Figure 3.12: Reconstruction of a target with an increasing number of eigenimages. further improved until a perfect reconstruction was achieved with all eigenimages. The reconstruction of a left-out target shared only local similarities with the original (Fig. 3.12) and the use of more eigenimages showed only incremental improvement. This is also reflected by the atlas identified as the most similar to the target, which shows only a limited degree of similarity (Fig. 3.13). Local differences become even clearer when applying the inverse of the reconstructed deformation field to the population template,

CHAPTER 3. TARGET-SPECIFIC TEMPLATE 88 Target Most similar atlas Figure 3.13: Target and the most similar atlas identified in the space of eigenimages. Figure 3.14: Cumulative variance explained by the eigenimages.

89 CHAPTER 3. TARGET-SPECIFIC TEMPLATE 88 Target Most similar atlas Figure 3.13: Target and the most similar atlas identified in the space of eigenimages. Figure 3.14: Cumulative variance explained by the eigenimages. Reconstructed inverse warped Original inverse warped Original Figure 3.15: The target obtained by applying the inverse of the reconstructed deformation field to the population template (left), the target obtained by applying the inverse of the original deformation field to the population template (middle), and the original image as reference (right).

90 CHAPTER 3. TARGET-SPECIFIC TEMPLATE 89 yielding the individual target tissue map (Fig left). This reconstructed individual target tissue map is blurrier and anatomical structures differ considerably from the target image (Fig right) and the individual target image retrieved by applying the inverse of the original deformation field to the population template (Fig middle). Results suggest that a global approach, where an image or deformation field is reconstructed as a whole, is not sensitive enough to capture the local anatomical variability. Since PCA is not capable of detecting complex patterns in the data, more sophisticated nonlinear manifold embedding methods should be considered Comparison of nonlinear dimensionality reduction methods and parameter optimisation The images of the LONI dataset were processed as outlined in Section 2.6. We used 53 labels of 30 atlases and 5 target images with low, medium and high expected segmentation accuracy. The two manifold learning methods LLE and Isomap were used for determining the ranking of the most similar atlases and weights for label fusion. Both methods were parametrised with varying numbers of dimensions and neighbours. The label fusion was performed with two different fusion strategies, weighted fixed tresholding (WFT-1/sim) and weighted intensity thresholding (WIT-1/sim), which are explained and evaluated in Sections in more detail. Since the main focus of this section is on the impact of using a TST we will only provide a brief introduction of the fusion methods. Both methods assign a local weight (W) to each label of an atlas candidate. Each weight is derived from the distance between the projection of its corresponding atlas ROI and target image in manifold space. We use two different ways of calculating the weights, which will be referred to as 1/sim and exp. The weighted sum of the probabilistic labels provide an estimate of the final segmentation. The final result was obtained by binarising this estimate either purely based

91 CHAPTER 3. TARGET-SPECIFIC TEMPLATE 90 on the probabilistic values and a fixed threshold (FT) or by also considering the intensity values of the target image in the ROI (IT). For the latter, we check whether the intensity value of a voxel falls within the intensity distribution estimated from high probability regions of the fused candidate labels in target space. Both binarisation methods can also be applied to unweighted (U) probabilistic estimates. This leads to multiple combinations, that are unweighted fixed tresholding (UFT) and intensity thresholding (UIT), and weighted fixed thresholding (WFT-1/sim or WFT-exp) and intensity thresholding (WIT-1/sim or WIT-exp). Majority voting (MV) counts the number of votes, each voxel receives from the candidates and assigns the label with the most votes to the voxel. A voxel gets a vote from a candidate when the probability of finding the label at the location is larger than 0.5. In the following experiments the weighted fusion method was used when a TST was constructed and unweighted fusion or majority voting was used when no TST was constructed, since the weights are usually only available when we construct a TST. Local thresholding was used for validation on the IBSR dataset, where the binarisation is based on a locally determined threshold. We also provide the results when patch-refinement was used for the NIREP dataset to allow a direct comparison. For both fusion strategies and manifold learning methods an improvement in segmentation accuracy was observed with an increasing number of dimensions, which flattened out after 7 dimensions and 7 neighbours (Fig. 3.16). The improvement was mainly caused by the increasing number of dimensions, while the number of neighbours had little influence on the accuracy with Isomap and no influence with LLE. These findings are in line with Duc et al. s more comprehensive analysis [75], in which they found that improvement is mainly governed by the change in number of dimensions rather than neighbours. In their study on hippocampus labels the maximum segmentation accuracy was reached with LLE and 11 dimensions and 23 neighbours on a dataset of 110 atlases, but it was pointed out that the choice of manifold embedding strategy and its parameters might differ based on the anatomical

92 CHAPTER 3. TARGET-SPECIFIC TEMPLATE WFT Dice overlap [%] d=3 n=3 ** d=7 n=3 d=7 n=7 d=7 n=9 d=9 n=9 d=11 d=3 n=11 n=3 d=7 n=3 d=7 n=7 d=7 n=9 d=11 n=23 Isomap LLE * (a) Weighted Fixed Thresholding WIT Dice overlap [%] d=3 n=3 d=7 n=3 d=7 n=7 d=7 n=9 Isomap d=9 n=9 d=11 d=3 n=11 n=3 d=7 n=3 d=7 n=7 LLE d=7 n=9 d=11 n=23 (b) Weighted Intensity Thresholding Figure 3.16: LLE and Isomap were tested with varying numbers of dimensions d and neighbours n for the construction of the low-dimensional manifold embedding. Error bars represent the standard deviation. A paired t-test was used to assess statistical significance: one asterisk represents p <.05, two asterisks represent p <.01, and three asterisks represent p <.001

93 CHAPTER 3. TARGET-SPECIFIC TEMPLATE 92 structure and size of the dataset. In our experiments these settings showed no improvement, which might be caused by the smaller number of atlases and the varying size of the anatomical ROIs in the dataset. Interestingly, most studies have used 2 or 3 dimensions to allow simple visualisation of the connected manifold graphs, while the use of 3 dimensions has shown to perform worst in our study. For all following experiments we chose 7 dimensions and 7 neighbours, which has shown to provide robust results with both manifold embedding strategies and fusion methods. It also represents a reasonable choice, considering the sizes of our datasets where the smallest consists of only 13 atlases Method validation on the LONI dataset In a cross-validation experiment the LONI dataset was divided into 4 subsets, each consisting of 30 atlases and 10 left-out target images. Isomap was used with the parameter settings from the previous section for the manifold learning. A visual comparison of the TST, population template and target, and their respective deformation fields between target and TST and between target and population template was performed. The overall accuracy gain by using a population template and our TST was compared to using only the population template for label propagation. The fusion for the latter method without using a TST was performed with majority voting (MV), where each voxel is assigned the label with the most occurrences in the candidate labels (Section 4.1), and unweighted fixed thresholding (UFT). When using the TST, the fusion was performed by weighted fixed thresholding, where the weights were calculated with two methods WFT-1/sim and WIT-1/sim (Section 4.9). For a more comprehensive evaluation of different fusion methods please see Section A paired t-test was used to assess statistical significance. Figure 3.17 shows the same axial 2-D view cut through the 3-D grey matter tissue map of a participant from the LONI dataset, the LONI grey matter

CHAPTER 3. TARGET-SPECIFIC TEMPLATE 93 Population template Target image Target-specific template Deformation: target to population atlas Target to targetspecific template 4 0 Figure 3.

The TST is much more similar to the target than the population template as evidenced by their corresponding deformation fields [13].

Note how the TST is more similar to the target image than the population template I, as evidenced by the visual comparison of the GM tissue maps.

94 CHAPTER 3. TARGET-SPECIFIC TEMPLATE 93 Population template Target image Target-specific template Deformation: target to population atlas Target to targetspecific template 4 0 Figure 3.17: The same axial 2-D view cut through the 3-D gray matter tissue map of a subject. The TST is much more similar to the target than the population template as evidenced by their corresponding deformation fields [13]. template, the participant s gray matter TST and the calculated deformation fields. Note how the TST is more similar to the target image than the population template I, as evidenced by the visual comparison of the GM tissue maps. The magnitude of the deformation fields, estimated between the target and the population template, D T I (bottom left) and between the target and the TST, D T TST (bottom right) shows that only a small deformation is required to align the the TST and target. The use of a TST can reduce the likelihood of getting stuck in local minima during the registration and consequently, the occurrence of registration errors. Our best Dice overlap of ± 1.17% over all targets and labels was achieved with a TST and our weighted voting and intensity thresholding method (Fig. 3.18). Paired samples t-tests showed significant increase in accuracy between MV (79.14 ± 1.24%) and UFT (79.81 ± 1.07%) without a

95 CHAPTER 3. TARGET-SPECIFIC TEMPLATE 94 Dice overlap [%] *** *** *** MV UFT WFT-1/sim WIT-1/sim no TST TST Figure 3.18: Improvement in segmentation accuracy by using our TST on the LONI dataset. Error bars represent the standard deviation. TST (p = ) between UFT without a TST and WFT-1/sim (80.07 ± 1.1%) with a TST (p = ) and between WFT-1/sim and WIT- 1/sim (80.53 ± 1.17%) with a TST (p = ) Method validation on the ADNI-HarP dataset The ADNI-HarP dataset was divided into 5 subsets of 26 target images and 105 atlases each. Four images were excluded from the set due to problems with the orientation information. In a cross-validation experiment the Dice overlap was measured after direct label propagation and fusion without a TST and compared to the overlap when using a TST. Fusion for the former was performed with simple UFT and for the latter with WFT and WIT. The fusion was performed with 40, 15, 10 and 5 candidates. Note that although no TST was used for the label propagation for the former, the manifold embedding and the ranking and selection of the most similar candidates was performed. Segmentation accuracy when using a TST and the associated weights for fusion improved for all four candidate choices, compared to propagation

96 CHAPTER 3. TARGET-SPECIFIC TEMPLATE *** *** *** *** Dice overlap [%] Dice overlap [%] UFT WFT-exp WIT-exp 80 UFT WFT-exp WIT-exp no TST TST no TST TST (a) 5 candidates (b) 10 candidates 88 *** *** 88 *** *** Dice overlap [%] Dice overlap [%] UFT WFT-exp WIT-exp 80 UFT WFT-exp WIT-exp no TST TST no TST TST (c) 15 candidates (d) 40 candidates Figure 3.19: Improvement in segmentation accuracy by using our TST, different fusion methods and candidate choices on the ADNI-HarP dataset. Error bars represent the standard deviation.

97 CHAPTER 3. TARGET-SPECIFIC TEMPLATE 96 Dice overlap [%] no TST *** TST Figure 3.20: Improvement in segmentation accuracy by using our TST and local thresholding method on the IBSR dataset. Error bars represent the standard deviation. without TST (p<.001) (Fig. 3.19). The intensity-based thresholding method WIT-exp showed superior results to fixed thresholding WFT-exp when a TST was used (p<.001) Method validation on the IBSR dataset In a cross-validation experiment the IBSR dataset was divided into 6 subsets of 3 images. Each fold was used as a target image set with the remaining images as atlases. The CSF label was removed as it was shown that its use leads to a strong bias in performance and accuracy [215], leaving a total of 31 labels. We compared segmentation accuracy when propagating atlas labels with our TST to label propagation without the TST. We used weighted fusion for the former and unweighted fusion for the latter and local thresholding for both methods (Section 4.9). Segmentation accuracy could be significantly improved from 83.6 ± 1.26% without a TST to ± 1.35% with a TST (p = )(Fig. 3.20),

98 CHAPTER 3. TARGET-SPECIFIC TEMPLATE 97 Dice overlap [%] ** * *** *** UFT WFT-exp WIT-exp WFT-exp WIT-exp no patch refinement no patch refinement patch refinement no TST TST Figure 3.21: Improvement in segmentation accuracy by using our TST, different fusion methods (WFT-exp, WIT-exp) and patch-based refinement on the NIREP dataset. Error bars represent the standard deviation Method validation on the NIREP-NA0 dataset We ran a 6-fold cross-validation experiment, where the 16 images were randomly split into 13 training images and 3 testing images. We compared segmentation accuracy achieved with our TST to the accuracy achieved without the TST. We used unweighted fixed thresholding (UFT) to fuse and threshold the images for the former, and weighted fixed thresholding (WFT-exp) and intensity thresholding (WIT-exp) for the latter. We also tested the effect of patch-based refinement when using a TST and intensity thresholding [13]. The use of a TST significantly improved segmentation accuracy from initially ± 1.77% with UFT and no TST to ± 2.14% with WFTexp (p = ) and further to ± 1.53% with WIT-exp (p = ) with a TST (Fig. 3.21). Patch-refinement improved accuracy for both fusion strategies (p<.05) with a significant effect between WFT-exp, which achieved ± 2.07%, and WIT-exp, which achieved ± 1.49%

99 CHAPTER 3. TARGET-SPECIFIC TEMPLATE *** *** ** UFT UIT WIT-exp WFT-exp no TST TST Figure 3.22: Improvement in segmentation accuracy by using our TST and different fusion and thresholding methods on the MICCAI 2012 dataset. Error bars represent the standard deviation. (p = ) Method validation on the MICCAI 2012 dataset The dataset comprises 15 atlas and 20 target images and their corresponding label maps. We compared segmentation accuracy achieved with our TST to the accuracy achieved without the TST. We used UFT, WFT-exp, UIT and WIT-exp for candidate label fusion and thresholding for the former and UFT and UIT for the latter. Our method reached a maximum Dice coefficient of ± 2.33% over all labels and targets when using a TST and WFT-exp (Fig. 3.22). It achieved significantly better results compared to WIT-exp (72.21 ± 1.91%) with a TST (p = ) and both unweighted variants, UFT (71.89 ± 2.35%) and UIT (72.00 ± 1.95%) when no TST was used (p<.001).

100 CHAPTER 3. TARGET-SPECIFIC TEMPLATE 99 Dice overlap [%] *** *** *** * UIT UFT WIT-exp WFT-exp no TST TST Figure 3.23: Improvement in segmentation accuracy by using a TST and different fusion strategies on the MICCAI 2013 dataset. Error bars represent the standard deviation Method validation on the MICCAI 2013 dataset The dataset was divided into 3 subsets of 12 images each. In each fold of the cross-validation study one subset was used for testing with the remaining images as atlases. We compared segmentation results achieved without a TST to results with a TST. Unweighted fixed thresholding (UFT) and unweighted intensity thresholding (UIT) were used as fusion methods for the former. Both, weighted fixed tresholding (WFT-exp) and weighted intensity thresholding (WIT-exp) were used with exp and a TST. The use of a TST and weights for the fusion significantly improved segmentation results compared to their unweighted variants without a TST (p<.001) (Fig. 3.23). The best result of ± 2.57% was achieved with WFT-exp which showed significantly larger segmentation accuracy than intensity thresholding with (p = ) and without the use of a TST (p = ).

101 CHAPTER 3. TARGET-SPECIFIC TEMPLATE Discussion In this section we tested linear and nonlinear dimensionality reduction methods for the reconstruction of a target image. The reconstruction of an atlas target with eigenimages showed high similarity to the original with only a small set of eigenimages. In contrast, a non-atlas target showed barely improvement, even with increasing number of eigenimages. This might be caused by the relatively small set of atlases for training and the use of linear methods for dimensionality reduction and registration. It could also be related to the global approach, where the eigenimages were calculated from the whole images, showing a large variability. We improved our approach by using nonlinear registration methods, nonlinear dimensionality reduction methods, deformation fields as comparison basis and local ROIs for the construction of a TST with high similarity to the target. In our evaluation of nonlinear manifold embedding methods, Isomap showed slightly better results than LLE and was used for the remaining experiments. Duc et al. [75] achieved their best results with LLE but no significant difference to the results with Isomap were found. Although they presented a comprehensive evaluation of the parameters, their segmentation results were based on only one anatomical region. Parameters and overall segmentation results might differ between anatomical regions and regions of different size. In contrast we compared manifold embedding methods on 53 different anatomical ROIs, which provides a better overall performance evaluation. However, it still does not guarantee the selection of the best parameters for each ROI. In the future we are planning to investigate this challenging task further. Another difference to our study is their use of STA- PLE [229] for label fusion (Section 4.3 for details), compared to weighted fixed thresholding and intensity thresholding in our approach. Since STA- PLE has shown mixed results depending on the ROIs [11], its use with only one ROI might be more prone to introducing a bias towards the best parameter configuration. Another question that remains is how many atlases

102 CHAPTER 3. TARGET-SPECIFIC TEMPLATE 101 are required to represent the manifold structure detailed enough to derive useful conclusions. In our experiments with Isomap we selected datasets with different cortical and non-cortical ROIs of varying size from healthy and diseased populations, which allowed tests with varying numbers of atlases. One drawback of our study is that the direct comparison between the eigenimage approach and nonlinear dimensionality reduction methods is difficult due to the use of different registration methods, and bases and scales of comparison. Although nonlinear methods, in general, outperform linear dimensionality reduction methods, it would be interesting to compare the results of eigenimages and Isomap side by side. Consistent significant improvement in segmentation accuracy was observed with all datasets when our TST was used as an additional intermediate step for the propagation of the labels, which is due to two main reasons. Firstly, our TST is much more similar to the target than the average population template because of its regional construction process, and refines the registration between the target and population template. Consequently, the likelihood of registration errors is reduced. Secondly, the use of weights, determined from the distances in manifold space, and intensity thresholding allows more accurate fusion of the propagated labels. Intensity thresholding showed the best results for all datasets except for MICCAI 2012 and MIC- CAI 2013, where fixed thresholding was superior. This might be caused by reduced intensity contrast and poorly defined borders in sub-cortical ROIs, where the intensity profile is not as characteristic. It could also be related to the parameter selection for intensity thresholding. In Chapter 4 we will compare our overlap results to those from other methods as reported in the literature and discuss parameter selection for intensity thresholding in more detail. Compared to other state-of-the-art MAS propagation and fusion strategies, our method is computationally more efficient. While other top performing methods require at least as many time-consuming nonlinear registrations as there are atlases, our method requires only two registrations at runtime.

103 CHAPTER 3. TARGET-SPECIFIC TEMPLATE 102 In a runtime comparison for the MICCAI SATA challenge (15 atlases, 20 targets) our method required 2 nonlinear registrations with SPM s DARTEL per target at runtime. All other top 10 methods of the SATA challenge required pairwise registration of every atlas and target, i.e. 300 in total. In Chapter 4 we will also compare our results to methods from the literature. In our approach 3 deformation fields are composed, which are between an individual atlas and the population atlas, between the population atlas and the TST, and between the TST and the target. In the work by Sjöberg, Johansson and Ahnesjö [188], a decrease in segmentation accuracy was observed, linear to the number of composed deformation fields. However, in their work they used a population template to find correspondences between atlases and the given atlases as intermediate images. A deformable image registration method [124] with a coarse B-spline grid spacing of 8mm in all directions was used for the alignment. In our approach we improve the quality of composed deformation fields by using our TST with large local similarity to the target as intermediate image instead of random atlas images and by using a registration method with millions of degrees of freedom (3 x #voxels) instead of a coarse grid of control points. Although, in general, the propagation of labels via composed deformation fields can be less accurate than via direct deformation fields, our results show that the selection and propagation of similar candidates with our TST as intermediate step can alleviate these drawbacks while increasing computational efficiency. In this chapter we have briefly introduced fusion methods, which will be discussed in more detail in the next chapter.

104 Chapter 4 Label Fusion One of the essential elements of a MAS algorithm is the combination, i.e. fusion, of the propagated atlas labels (candidate labels). Based on the two approaches, single- and multi-atlas segmentation, described in Sections , the propagated atlas labels are either fused in average template space or in target space, respectively. In general, we can split combination strategies into global and local methods. Global methods assign the same weight to the the labels of a candidate. One global method is majority voting, where the most frequently occurring label is assigned to a voxel. It has provided robust results in multiple studies [98, 99, 168]. Another popular method, initially developed for the fusion of manual segmentations and the evaluation of the human raters, is STAPLE [229]. In contrast to majority voting, weighted voting allows to assign a different weight to the labels of each candidate [12, 107]. The weight is usually calculated from the similarity between each candidate image and the target globally. Global combination strategies cannot guarantee that the selected candidate and its associated weights are the best choice for every ROI. Locally weighted fusion is able to assign weights region-wise or voxel-wise. These include methods such as STEPS [47], joint label fusion [221] and patchbased label fusion [64]. However, the degree of locality is dependent upon 103

105 CHAPTER 4. LABEL FUSION 104 the local contrast of the ROI [11]. Voxel-wise weighting is highly sensitive to noise in low-contrast neighbouring ROIs, whereas global weighting is more robust. Conversely, in high-contrast ROIs local weights are superior. 4.1 Majority voting The simplest label fusion strategy is majority voting which counts the votes for each of the L possible labels at every voxel x [11, 119]: f i (x) = K k=1 w k,i(x) for i = 1,..., L and assigns the most frequently occurring label to it where E MV (x) = max [f 1 (x),..., f L (x)] (4.1) 1, if i = e k (x) w k,i (x) = 0, otherwise (4.2) Although very simple, it is a powerful and robust fusion method [169]. However, it does not incorporate a measure of similarity between target and atlas images. 4.2 Weighted voting In contrast to majority voting, where each candidate s vote has the same power, weighted voting [12] allows to assign different weights to the candidates. Consequently, candidates, which are expected to yield a higher segmentation accuracy, get more power in the decision making. Label weights can be calculated locally or globally and are usually based on a chosen similarity metric. The global weights for expression 4.1 can be calculated as [11, 119] m p, if i = e k (x) w k,i (x) = (4.3) 0, otherwise where m is a measure of similarity and p a gain exponent to adjust the impact of the weight.

106 CHAPTER 4. LABEL FUSION 105 Local weights can be calculated as [m(s, r)] p, if i = e k (x) w k,i (x) = 0, otherwise (4.4) with the local similarity measure m(s, r), where s is the shape of the patch and r is its radius. 4.3 Simultaneous truth and performance level estimation (STAPLE) One of the main underlying problems of a MAS algorithm is the inter- and intra-operator variability introduced by manual segmentation, which is, however, still considered as the gold standard. Due to the propagation and fusion of these labels, one of the main error sources in MAS can be identified as inaccurate manual segmentation. STAPLE [229], initially designed to combine and rate expert manual segmentations and later applied in MAS, addresses this issue by computing a probabilistic estimate of the true segmentation and providing a measure of the performance level of each of the candidates. In the context of MAS a candidate is considered as an individual atlas. In line with the weighted voting algorithm, candidates with a higher performance should be assigned higher weights. However, the weights in STAPLE do not take the similarity between intensity images into account. Instead, the fusion algorithm calculates a confusion matrix. It uses an expectationmaximisation algorithm to calculate weights as a function of the estimated sensitivity p i and specificity q i characteristic for each candidate: w i = f(p i, q i ) where the performance of a candidate can be summarised as θ i = (p i, q i ) T. Here we consider the binary case, which can be extended to multiple labels. Optimal weights are determined based on the estimated performance level θ = [θ 1, θ 2,..., θ R ] of each of the R candidates, an a priori model for the spatial distribution of correlated ROIs and spatial homogeneity constraints

107 CHAPTER 4. LABEL FUSION 106 enforced by a Markov random field model. The goal is to maximise the log likelihood cost function of the complete data (D, T), where T is the true hidden segmentation and D holds the candidate segmentations and can be described as (ˆp, ˆq) = arg max log (f(d, T p, q)) (4.5) p,q The true segmentation T is unknown. The core algorithm iterates through two main expectation-maximisation (EM) steps. First, in the E-step, the conditional expectation of the true segmentation w k i iteration k is estimated by to be 1 at voxel i and w k i f(t i = 1 D i, p k, q k ) = ak i a k i + bk i (4.6) with a k i f(t i = 1) j f(d ij T i = 1, p k j, q k j ) (4.7) b k i f(t i = 0) j f(d ij T i = 0, p k j, q k j ) (4.8) Second, in the M-step, sensitivity and specificity (p, q) are estimated by q k+1 j = p k+1 j = i wk i D ij i wk i (4.9) i (1 wk i )(1 D ij ) i (1 wk i ) (4.10) STAPLE was mainly designed to combine small numbers of candidate labels and has shown to produce mixed results depending on the ROIs and dataset, for which it received criticism [11, 66]. Another limitation of STAPLE is its purely probabilistic approach, which does not take local intensity similarity measurements into account. In an improved version of the method called non-local STAPLE [19], the initial statistical fusion algorithm was merged with a non-local approach that incorporates patch-wise weight estimation (for more details on patch-based fusion see Section 4.8).

108 CHAPTER 4. LABEL FUSION Selective and iterative method for performance level estimation (SIMPLE) SIMPLE [129] aims to combine the advantages of atlas selection, where only a subset of the most promising candidates is used for fusion, and the truth and performance level estimation presented by STAPLE. SIMPLE s core algorithm, just like STAPLE, iterates through estimation of performance of each rater and estimation of the ground truth segmentation, but, in contrast to STAPLE, candidates with a low performance are not considered in future iterations. The exclusion of bad candidates limits their influence on the final segmentation. The algorithm starts by estimating the performance of each candidate image as its NMI similarity to the target image. Then all candidate labels are fused into a ground truth estimate L est with majority voting. These initial performance and ground truth estimates are refined by iterating through 3 main steps: First, the performance of each candidate label at iteration k is estimated by measuring the overlap of each candidate label L i and L est as Φ i = f(l i, L k est). Second, candidate labels are only included in further iterations L i ɛ L k+1 if Φ i > Θ. Third, the new segmentation L k+1 est calculated with majority voting. SIMPLE has shown to outperform STAPLE and locally weighted methods based on NMI, but two main drawbacks can be identified. Due to its strong dependence on the initial ground truth estimate, the method could fail if the candidate labels are poorly aligned. And the method works globally, which might lead to the exclusion of useful local information. is 4.5 Similarity and truth estimation for propagated segmentations (STEPS) Another method which evolved from STAPLE is STEPS [47]. In contrast to the global approach in STAPLE, STEPS selects and fuses the locally

109 CHAPTER 4. LABEL FUSION 108 best ranked templates based on locally normalised cross correlation (LNCC), for which a new hidden parameter was introduced, which equals 1 if the candidate is top ranked for a particular location. The concept of sensitivity and specificity was extended for the use of multiple classes, by introducing a confusion matrix which provides a measure of agreement or disagreement between the candidate segmentations and the fused segmentations. 4.6 MALP A different approach was taken by Ledig et al. [131], in which the spatial information provided by MAS is only used as spatial prior for an intensity model solved by EM optimisation. Assuming that the voxel intensities belonging to one of the K labels are normally distributed, the goal can be formulated as finding the unknown segmentation z i based on the image intensity distribution. The conditional probability of finding an intensity y i at a voxel i can be computed as f(y i φ) = k f(y i z i = e k, φ)f(z i = e k ) (4.11) where f(z i = e k ) is the probability of voxel i belonging to label k and f(y i z i = e k, φ) is the Gaussian distribution of intensity values belonging to label k described by mean and standard deviation φ = {(µ 1, σ 1 ),..., (µ K, σ K )}. The EM algorithm iterates through calculating the label probability and maximising f(y φ). The model was further refined by incorporating a priori spatial information from MAS using Markov random fields (MRF). 4.7 Joint label fusion Most weighted atlas fusion strategies calculate the weight for each candidate label independently as the similarity between the candidate image and the target. Consequently, the same label errors, produced by different candidates, are difficult to detect and will contribute to the final segmentation.

110 CHAPTER 4. LABEL FUSION 109 The aim of joint label fusion [221] is to reduce these correlated segmentation errors by finding the optimal weights to reduce the expected total error. The joint error distribution of two candidate labels is estimated based on the similarity between their respective candidate images and the target. The voting weights for n candidates are obtained by w x = M x 1 1 n (4.12) 1 t nmx 1 1 n where 1 n = [1; 1;...; 1] and M x is the pairwise dependency matrix, which is estimated from the intensity similarities of each candidate pair (F i, F j ) and the target image F T : M x (i, j) = p(δ i (x)δ j (x) = 1 F T (y), F i (y), F j (y) yɛn(x)) F T (y) F i (y) F T (y) F j (y) yɛn(x) β (4.13) where N(x) is the neighbourhood for comparison around a voxel x. The approach was further extended [222] by a method to correct systematic errors [220], introduced by the segmentation model or optimisation algorithm. In the case of one label, one AdaBoost classifier is trained on the atlas image set to discriminate between correctly and incorrectly labeled voxels. After the candidates are fused, the classifier checks the label of each voxel and potentially corrects it. 4.8 Patch-based label fusion The joint label fusion algorithm requires the estimation of a dependency matrix, which is created by measuring the similarity of the local neighbourhood around each voxel in the candidates and the target. This simple local similarity measurement method is generally referred to as patch-based comparison. The patch-based algorithm is based on Buades et al. s [41] non-local means

111 CHAPTER 4. LABEL FUSION 110 algorithm, initially designed for image denoising, and has later been adapted for the use as label fusion method [64]. The method compares the patch P (x i ) around each voxel x i in the target image to the patches P (x s,j ) within a search volume V i around the corresponding voxel x s,j in each of the N atlas images. Based on image intensity similarities between patches, weights w(x i, x s,j ) are assigned to the corresponding atlas labels y s,j resulting in the probability with the weights v(x i ) = N s=1 jɛv i w(x i, x s,j )y s,j N s=1 jɛv i w(x i, x s,j ) (4.14) w(x i, x s,j ) = exp P (x i ) P (x s,j ) 2 2 h if ss > th 0 else (4.15) Atlas patches are pre-selected based on the structural similarity ss = 2µ iµ s,j 2σ i σ s,j σ 2 i +σ2 s,j µ 2 i +µ2 s,j [227] to the target patch to discard potentially unsuitable patches when it is less than a pre-defined threshold th. Additionally, a pre-selection step can be applied to discard atlases as a whole and the ROI for patchcomparison can be reduced to the specific anatomy to improve efficiency. A main advantage of the method is the use of redundant information from an increased number of samples, taken from different atlases, to make the decision more robust. Consequently, the impact of registration errors is reduced due to the patch comparisons within a local neighbourhood. The method was initially proposed with only linearly aligned atlases without requiring a time-consuming nonlinear registration. A more efficient approach has been proposed by Rousseau et al. [174], where, instead of comparing patches around every voxel, only a subset of patches is considered and, rather than determining a label only for the centre voxel of each patch, a label estimate can be provided for every voxel within the patch. The basic patch-based approach has been further extended by modelling a

112 CHAPTER 4. LABEL FUSION 111 patch as a sparse linear superposition of patches [237, 243], which can be solved with linear regression. The resulting reconstruction coefficients can then be used as weights for the label fusion. A probabilistic patch-based model was proposed by Bai et al. [25], which incorporated label information into the registration process to improve both segmentation and registration accuracy. 4.9 Our approach to propagation, fusion and binarisation Equipped with a TST, we can now non-linearly register the affinely transformed target image T to it to get deformation field D T TST using SPM s DARTEL in a non-iterative fashion, for maximum efficiency. Since the TST is very similar to T, this non-linear registration should be very accurate. For each label r we then warp all individual label maps { L r j } j onto the 1 target by using the composed transformation D T TST D TST I DĨj I and trilinear interpolation to get N candidate segmentations { L r j } j=1...n. We then use a local fusion strategy and compute for each label the (1) unweighted (UW), where each candidate segmentation receives the same weight, and (2) weighted (W) sum of the candidate segmentations with the same weights we used for TST construction: L r = N j=1 ωr j L r j with N j=1 wr j = 1. Note that the unweighted method uses the probabilistic values for fusion and ranked candidates, compared to majority voting, which uses binarised labels and counts the occurrence. The weights for the weighted scheme were calculated either with wj r = 1 D j (1/D) or wj r = e D j std(d) (exp) for comparison, where D is the distance matrix and D j is the distance between the target instance and the j-th candidate in manifold space. We project those back onto the space of the input target image, by applying the inverse of the previously estimated affine transformation using trilinear interpolation. Finally, we threshold the probabilistic values so they can be compared

113 CHAPTER 4. LABEL FUSION 112 against ground truth delineations. Four thresholding strategies were implemented and tested: 1. A fixed threshold (FT) θ is used to binarise the probabilistic values: 1 if Lr (x) > θ Y (x) = (4.16) 0 else 2. We use the probabilistic values and also consider the intensity values of the target to further refine the segmentation (IT). We check, similarly to Doshi et al. s approach [74], whether the intensity of a voxel falls within the intensity distribution, given by the mean and a multiple c of the standard deviation, and estimated from those areas of the target image with a probability of at least δ% of the maximum probability: Y = {x L r (x) > δ max( L r )}. Empirical tests showed that this refinement did not systematically improve performance for non-cortical regions, which is consistent with Ledig et al. s findings [131] and is probably due to the fact that tissue contrast around cortical regions is much stronger than around subcortical structures. We further improved the method by checking if the probability of the identified voxels exceeds the threshold defined by ɛ r (x) = std( L r ). 3. A local threshold (LT) δ r (x) = std y N(x) ( L r (y)) is calculated from the probabilistic label maps where N(x) is a neighbourhood around voxel x whose size is dictated by that of the label, with larger labels yielding larger neighbourhoods (see Section ). 4. In [13] we presented an efficient hierarchical method based on our approach with a TST. After regional weighted candidate fusion with a fixed threshold we further refine regions below a predefined threshold, since these are usually more likely to be misclassified. In these regions we recalculate the weights locally with a patch-based approach (Section 4.8), providing a second set of locally refined classification probabilities.

114 CHAPTER 4. LABEL FUSION Experiments We tested several fusion methods with the approach described in Section 3.4. For some of them we provide the segmentation accuracy achieved without a TST as a reference. The used parameter values for the fusion and thresholding methods will be provided for each experiment Evaluation of fusion and thresholding strategies on the LONI dataset Various fusion strategies were tested in a cross-validation experiment with four subsets of 30 atlas and 10 target images each. One subset was used to empirically determine the parameters for intensity thresholding by varying the values for the coefficient c and the threshold δ. The intensity distribution was sampled from those areas in the target with the δ highest likelihood of being part of the label, as given by the fused candidate labels. Then two variants of the method were tested. First, each target voxel within the search region, defined by probability values larger than zero, was compared to the distribution to make a final decision. Second, the search region was further limited to voxels with a probability of at least the standard deviation of the fused label probabilities. Once the best parameters were found, the fusion and thresholding strategies MV, WFT-exp, WFT-1/sim, WIT-exp and WIT- 1/sim were compared with a fixed threshold θ = 0.33 for FT methods, and δ = 0.4 and c = 4 for IT methods. Our variant of using a refined search region (wt) showed improvement over the basic variant of using just the intensity distribution (Fig. 4.1). A steep incline can be noticed from c = 1 to c = 2 for both variants and all choices of δ, which flattens faster for higher values of δ. The basic variant showed a decline with increasing coefficients, which is not the case with the refined variant, which is also governed by the label probability. Consequently, for larger values of c and δ the intensity distribution has less impact. The best result was obtained with c=4 and θ=0.4. In contrast to other datasets,

115 CHAPTER 4. LABEL FUSION 114 Dice overlap [%] *max (wt) 0.2*max (wt) 0.3*max (wt) 0.4*max (wt) 0.5*max (wt) 0.1*max 0.2*max 0.3*max 0.4*max 0.5*max 1*std 2*std 3*std 4*std 5*std 6*std 7*std coefficient c Figure 4.1: Empirical evaluation of the intensity thresholding parameters. The coefficient c is systematically increased on the x-axis with varying values for δ. Both intensity thresholding variants, with (wt) and without refinement of the search region, were tested. for which we consistently used c=3 and δ=0.1, these parameters have large values which is due to the wide intensity distribution of the LONI labels. Some of these labels contain multiple tissue types such as GM and WM. A segmentation accuracy of ±1.24% was achieved with MV without a TST and is provided as a reference (Fig. 4.2). The use of a TST did not improve accuracy when using MV. It might not be sensitive enough to detect changes in the probability due to the binarisation of the probabilistic values before the label frequency count is performed. However, weighted methods such as WFT and WIT, which use the probabilistic values, were capable of detecting these changes. Significant improvements in segmentation accuracy were achieved with WFT-exp over MV (p = ), WFT-1/sim over WFT-exp (p = ), WIT-exp over WIT-1/sim

116 CHAPTER 4. LABEL FUSION 115 Dice overlap [%] *** ** *** *** *** MV MV WFT-exp WFT-1/sim WIT-exp UIT WIT-1/sim no TST TST Figure 4.2: Comparison of fusion and weighting strategies on the LONI dataset with a TST. The overlap achieved with MV without a TST is given as a reference. Error bars represent the standard deviation. 16 MV WFT-exp WFT-1/sim WIT-exp UIT WIT-1/sim Number of ROIs Dice overlap [%] Figure 4.3: The histogram outlines the number of anatomical ROIs of the LONI dataset that fall within a certain range of accuracy. For each bin and for each fusion method the number of ROIs is provided. (p = ), UIT over WIT-exp (p = ) and WIT-1/sim over UIT (p = ). The best result of ± 1.17% was achieved with

117 CHAPTER 4. LABEL FUSION 116 WIT-1/sim weighted voting and intensity thresholding. Although a statistically significant improvement was found between the three weighted and unweighted intensity thresholding methods, UIT showed with ± 1.18% a high level of accuracy. Since intensity thresholding also takes the intensity distribution into account, it can alleviate the impact of registration errors. The cortical labels of the LONI dataset contain white and grey matter tissue which might not allow a clear classification of voxels due to the large intensity distribution in high-probability ROIs. Another reason that might explain the inferior result of WIT-exp compared to UIT is the quality of the registrations. In Klein et al. s [122] evaluation of registration methods DARTEL was ranked sixth for the LONI dataset, while performing excellent on two other datasets. Consequently the deformation fields might not be precise enough to derive reliable weights with WIT-exp. The histogram in Fig. 4.3 shows the accuracy improvement for the LONI ROIs by using fixed and intensity thresholding methods. For MV and fixed thresholding methods a large proportion of the 53 ROIs reached less than or equal 78% accuracy. With the use of intensity based methods a shift in the number of ROIs that reached a higher accuracy is noticeable, where WIT-1/sim and UIT have the highest number of ROIs above 86% compared to other fusion methods. Noticeable is the large number of ROIs for MV at the 84-86% bins, which count more or at least an equal number of ROIs compared to the highest ranked methods. This compares favourably against the results reported by Wu et al. [237] who implemented three patch-based methods, their own (SCPBL) as well as that by Coupe et al. [64] (PBL) and by Rousseau et al. [174] (SPBL), and tested all three on the LONI dataset. The reported Dice scores were ± 2.35% for PBL, ± 1.96% for SPBL and ± 1.34% for SCPBL, all inferior to ours. This also outperforms the recently proposed method by Zikic et al. [245], who reached ± 4.53%, for a comparable computation time. Note, even our simpler weighted voting method with fixed thresholding achieved comparable results of 80.07% ± 1.1% to Zikic s method. Our

118 CHAPTER 4. LABEL FUSION 117 method was slightly inferior to the one presented by Wu et al. [236], which achieved ± 2.25% but used only one set of test images for evaluation while we performed cross-validation experiments. Better results were also achieved by Ma et al. [136] with 82.56% ± 4.22% at great computational cost, since they require pairwise registration between each atlas and target, and the comparison of local patches. Considering one subset of 30 atlases and 10 targets it would take 300 nonlinear registrations at runtime with their method, compared to only 20 nonlinear registrations with our method Evaluation of fusion and thresholding strategies on the ADNI-HarP dataset The dataset was divided into 5 subsets of 26 images each and a TST was constructed for each target. Four methods (UFT, WFT-exp, UIT, WIT-exp) were tested with 1, 3, 5, 10, 15 and 40 candidates on one of the subsets. The fixed threshold θ was set to 0.35 for the methods with FT. For methods with IT, δ was set to 0.1 and the coefficient c to 3. With the same settings, we performed a cross-validation experiment on all subsets for 5, 10, 15 and 40 candidates. It should be noted that even though each candidate was assigned the same weight in the unweighted methods, the ranking and selection of the candidates was performed with manifold learning. The overall best result of ± 2.89% for both ROIs on one subset was achieved with 15 candidates and unweighted intensity thresholding (UIT). Both the unweighted (UIT) and weighted intensity thresholding (WIT-exp) methods showed a significant improvement in segmentation accuracy over unweighted (UFT) and weighted fixed thresholding (WFT-exp) for all candidate choices (p<.001). The use of all 40 candidates lead to a slight decrease in accuracy for both unweighted methods but an increase for WIT-exp. The observed decrease is consistent with the findings in the literature and can be explained by the introduction of unsuitable atlases and, in turn, segmentation errors. The increase might be caused by the weighting scheme, which is

119 CHAPTER 4. LABEL FUSION 118 WIT-exp WFT-exp UIT UFT Dice overlap [%] # of candidates Figure 4.4: Comparison of fusion strategies and weighting schemes with varying numbers of candidates on one subset of the ADNI dataset with a TST. capable of identifying unsuitable atlases and assigning them a lower weight. When using only 5 candidates and UIT, we still reached a Dice overlap over 84%. When further reducing the number to 3 candidates, intensity-based methods were still superior (83.5%) compared to the best overall results achieved with fixed thresholding (83.27%). This shows that intensity thresholding is efficient, robust to the number of candidates and to low intensity contrast ROI such as the hippocampus. Surprisingly, the unweighted strategies (UFT, UIT) achieved similar or even slightly better results than the weighted strategies (WFT-exp, WIT- 1/exp) for 15, 10 and 5 candidates. One reason might be that empirical tests for determining the intensity thresholding parameters were performed with 40 candidates, where the weighted strategies slightly outperformed the unweighted strategies. Another reason for the similar results achieved with unweighted and weighted methods might be that the ranking and selection of the most similar atlases in manifold space was performed for both strategies. The large number of atlases in the training set increases the chance of finding similar candidates to the target, which would lead to evenly distributed

120 CHAPTER 4. LABEL FUSION 119 UFT WFT-exp UIT WIT-exp UFT WFT-exp UIT WIT-exp *** ** left Hippocampus *** *** right Hippocampus *** * left Hippocampus *** right Hippocampus (a) 40 candidates (b) 15 candidates UFT WFT-exp UIT WIT-exp UFT WFT-exp UIT WIT-exp *** *** * left Hippocampus *** right Hippocampus * *** ** left Hippocampus *** right Hippocampus (c) 10 candidates (d) 5 candidates Figure 4.5: Evaluation of fusion methods for each ROI and different candidate numbers for the ADNI dataset. Error bars represent the standard deviation.

121 CHAPTER 4. LABEL FUSION 120 weights for the weighted strategies. Consequently, the use of weighted methods would have a similar effect as the use of unweighted methods, where all candidates are assigned the same weight. The ADNI-HarP dataset was also used in Platero and Tobar s study [157] for the evaluation of their registration- and patch-based method. They presented the results achieved with Freesurfer [80] and their own approach. Freesurfer reached an overall Dice coefficient of 78.2 ± 4.1%, which is inferior to our maximum of ± 2.89%. In order to obtain a similar level of accuracy as Freesurfer, our method would require only 1 candidate with a fixed thresholding method. However, it should be noted that Freesurfer uses a training dataset with different anatomical definitions. Platero s own method achieved ± 4.5% which is slightly better than our best results. However, it requires the computationally expensive direct nonlinear registration between each selected atlas and the target and additionally patch-refinement. Patch-based refinement as a post-processing step could also be added to our processing pipeline as outlined in Section 4.9 and tested in Section Benkarim et al. [31, 30] used a set of ADNI images of similar size to ours and tested MV, STAPLE, STEPS, joint label fusion and their own learning-based method SCMWF2 with different registration methods. In the best configuration they reached 76.7 ± 4.9%, 76.8 ± 5.8%, 79.9 ± 4.3%, 86.0 ± 3.7% and 86.6 ± 2.6% respectively. Our results are superior to all except joint label fusion and SCMWF2. However, in SCMWF2 the correspondence between each of the atlases and targets was achieved via the creation of an average population atlas and the composition of the corresponding deformation fields. This population atlas was constructed from both atlas images and target images. Consequently, to find the correspondence between a new target and each atlas, an average population template would have to be constructed for each target at runtime, leading to enormous computational costs. The only alternative would be to directly register each atlas and target at runtime. In both articles SCMWF2 was only tested on datasets with sub-cortical ROIs.

122 CHAPTER 4. LABEL FUSION 121 Dice overlap [%] *** ** *** UIT WIT-exp UFT WFT-exp WLT-exp Figure 4.6: Comparison of fusion and weighting strategies on one subset of the IBSR dataset with a TST. Error bars represent the standard deviation Evaluation of fusion and thresholding strategies on the IBSR dataset The IBSR dataset was divided into 6 subsets of 3 images each. UIT, UFT, WIT, WFT and weighted local thresholding (WLT) was applied to all subsets after constructing a TST for each target. The fixed threshold θ was set to 0.35 and the intensity thresholding parameters to δ = 0.1 and c = 3. The CSF label was excluded as it has been shown to introduce a bias [215]. Both weighted methods (WFT-exp, WIT-exp) outperformed their unweighted variants (UFT, UIT) (Fig. 4.6) but no statistcally significant difference was found (p>.05). Both fixed thresholding methods showed significantly higher segmentation accuracy than the intensity thresholding method (p<.001). This could be caused by the dataset-specific labels, which contain multiple tissue types within the same label, e.g. the GM label also contains parts of CSF, WM and background, and poor contrast between ROIs. This also questions the viability of the dataset for the measurement of segmenta-

123 CHAPTER 4. LABEL FUSION 122 tion accuracy. The best results of ± 1.35% were obtained with weighted local thresholding (WLT), which outperforms Zikic et al. s method [245], which achieved 83.5 ± 4.2%. We also outperformed the method by Rousseau et al. [174], which reached an overall Dice score of 83.5%. The performance of our method is however slightly inferior to that recently reported by Doshi et al. [74], ranked first in the MICCAI 2013 segmentation challenge, and which reached ± 1.3% with similarity ranking and boundary modulation. However, their approach requires two registrations between each of the selected atlas images and the target at runtime, which makes it computationally very expensive. Considering one target image and 15 atlases, it would require them 30 nonlinear registrations at runtime, compared to only 2 nonlinear registrations with our method. They also reported their results when using MV and global similarity ranking without the boundary modulation, for which they achieved % ± 1.36 % and % ± 1.3 %, respectively. Both are inferior to our results at much higher computational costs Evaluation of fusion and thresholding strategies on the NIREP-NA0 dataset The dataset was divided into 6 subsets of 3 images each. In a cross-validation experiment a TST was constructed for each target and two thresholding methods (WFT-exp, WIT-exp) were tested with all candidates. The fixed threshold θ was set to 0.35 for FT methods. For methods using IT, δ was set to 0.1 and the coefficient c to 3. In addition, we tested a larger δ of 0.26 and evaluated the impact of patch-refinement for regions with a probability below 0.4. Figure 4.7 shows a slice through a target scan with the manual ground truth label in yellow (left) and the fused candidate segmentations in green (middle). Based on the pre-defined threshold we divided the fused label into high and

thereof (right). before absolute change after 0.2 0 Figure 4.

achieved by the patch-refinement (middle) and the final segmentation result after the patch-refinement (right).

124 CHAPTER 4. LABEL FUSION 123 Figure 4.7: Slice through one of the target images with one ground truth label (left), the fused candidate segmentations (middle), and high (blue) and low (red) probability regions derived thereof (right). before absolute change after Figure 4.8: Enlarged ROI with the probabilistic label created with weighted fusion of the candidate segmentations before the patch-refinement (left), the absolute change in probabilities achieved by the patch-refinement (middle) and the final segmentation result after the patch-refinement (right). The white arrows indicate regions where substantial improvement was achieved. low probability regions as indicated in blue and red respectively (right). For this particular ROI the number of patch-comparisons could be reduced by a half from approximately to The improvement by using the patch-refinement is illustrated in Figure 4.8. After the refinement the label provided a more detailed and crisper outline of the anatomical structure. A significant improvement (p<.001) was observed between weighted fixed

An Introduction To Automatic Tissue Classification Of Brain MRI. Colm Elliott Mar 2014

An Introduction To Automatic Tissue Classification Of Brain MRI Colm Elliott Mar 2014 Tissue Classification Tissue classification is part of many processing pipelines. We often want to classify each voxel