A hybrid unsupervised and supervised clustering applied to microarray data

Size: px

Start display at page:

Download "A hybrid unsupervised and supervised clustering applied to microarray data"

Dwain Curtis
6 years ago
Views:

1 A hybrid usupervised ad supervised clusterig applied to microarray data Raul Măluţa, Pedro Gómez Vilda, Moica Borda Abstract Clusterig methods have bee ofte applied to large data with the mai purpose of reducig the dimesio, time computatio ad idetifyig clusters with similar behavior. This work presets a state-of-the-art i usupervised clusterig ad cluster validatio. It proposes a method for hybrid bi-clusterig of microarray data combied with a supervised validatio for determiig the optimal amout of clusters of gees. Keywords microarray, clusterig, iteral validatio, exteral validatio, gee shavig. I. INTRODUCTION The microarray data processig challeges owadays cosists of how to make it more reliable, easy to use ad efficiet. I this task other fields of kowledge as Sigal ad Image Processig, Patter Recogitio, or Statistical Data Aalysis [1] may help i yieldig their eormous potetial i solvig problems as microarray image ehacemet, segmetatio, correctio, griddig, data aalysis, reliable expressio estimatio i relatio with hybridizatio dyamics, etc. Others have to see with data iterpretatio, dimesioality reductio, cluster aalysis, etc. Clusterig techiques play a importat role i aalyzig high dimesioal data such as microarray data. A aalysis of microarray data is a search for gees that have similar, correlated patters of expressio. This idicates that some of the data might cotai redudat iformatio. For example, if a group of experimets were more closely related tha it was expected, some of the redudat experimets ca be igored, or use some average of the iformatio without loss of iformatio. Data aalysis methods [2] ca be grouped i two categories: supervised ad usupervised. I the Mauscript received Ortober 28, 2012, revised March 04, This work was supported by the project "Developmet ad support of multidiscipliary postdoctoral programmes i major techical areas of atioal strategy of Research - Developmet - Iovatio" 4D-POSTDOC, cotract o. POSDRU/89/1.5/S/52603, project co-fuded by the Europea Social Fud through Sectoral Operatioal Programme Huma Resources Developmet R. Maluta is with the Commuicatios Departmet, Techical Uiversity of Cluj-Napoca, George Baritiu St., Cluj-Napoca, Romaia, (phoe: ; fax: ; raul.maluta@com.utcluj.ro). P. Gómez Vilda is with Departameto de Arquitectura y Tecología de Sistemas Iformáticos (DATSI), Uiversidad Politécica de Madrid, Campus de Motegacedo, s/, 28660, Boadilla del Mote, Madrid, Spai ( pedro@pio.datsi.fi.upm.es). M. Borda is with the Commuicatios Departmet, Techical Uiversity of Cluj-Napoca, George Baritiu St., Cluj-Napoca, Romaia ( Moica.Borda@com.utcluj.ro). doi: /ijates.v2i3.21 usupervised approach, also kow as clusterig, data is orgaized without a priori iformatio. This paper describes several usupervised algorithms used mostly for microarray data, like k-meas, Partitioig Aroud Medoids (PAM) ad Expectatio-Maximizatio (EM). These algorithms have prove to be a useful whe the umber of clusters is kow or ca be estimated. Two of them, k-meas ad PAM, are based o miimizig the mea squared error, while the third, EM, o mixture modelig. The algorithms were ru o several data sets [3], observig that the quality of the obtaied clusters is depedet o the umber of clusters specified. To assess the effectiveess of these algorithms, but also to estimate the actual umber of clusters, several clusterig validatio method were implemeted. These methods cosist i calculatig iteral ad exteral idices used to estimate the optimal umber of clusters for each algorithm. The microarray data has a particularity compared with other type of data. A two dimesio data is i fact a gee expressio matrix which usually has the rows correspodig to gees from a experimet ad the colums correspodig to differet experimets. If oe fids that two rows are similar, it ca be assumed that the gees correspodig to the rows are co-regulated ad fuctioally related, ad by comparig two colums it ca foud which gees are differetially expressed i each experimet. To perform a compariso a similarity measure betwee the objects, gees or experimets uder compariso, has to be used. Mostly the same algorithm is used for aalyzig both the gees ad the experimets, i.e. bi-clusterig of microarray data. Cosiderig the dimesioality of the data, large amout of gees ad small umber of experimets, we propose i this paper a hybrid bi-clusterig, where we combie usupervised methods with the purpose of obtaiig the optimal combiatio of clusters. Besides, as a supplemetary validatio, a supervised method was used o the same data. Supervised classificatio represets the issue of idetifyig the subset to which ew observatios belog, where the idetity of the subset is ukow, o the basis of a traiig set of data cotaiig observatios whose subset is kow. Therefore the classificatio will display a variable behavior which ca be aalyzed by statistics. It is required for ew sample items to be placed ito the respective groups based o quatitative iformatio o oe or more measuremets, attributes or features ad based o the traiig set i which previously decided groupigs are already established. 99

2 II. CLUSTER ANALYSIS A. Usupervised algorithms The usupervised clusterig methods are divided i two categories: hierarchical ad o-hierarchical. Hierarchical methods group the objects i a iterative way, geeratig a hierarchical tree structure, also kow as dedogram. O the other had, the o-hierarchical clusterig, or partitioig methods, does the partitioig of a data set ito a predefied umber of differet clusters, without a hierarchical structure. Beside, these algorithms produce a iteger umber of partitios, ad also they optimize a certai criterio fuctio. The partitioig methods classify the data i k clusters which must fulfill the followig coditios: each group should cotai at least oe elemet; each object must belog to oe group oly. For the microarray data the most suitable clusterig methods are usupervised oes, because we caot observe the (real) umber of clusters i the data [4]. From the usupervised algorithms used i microarray data aalysis the k-meas, PAM (Partitioig Aroud Medoids) ad EM (Expectatio Maximizatio) are the most frequetly used. The first two from the list are usig the miimizatio of the root mea square error ad the third oe is usig the mixig models methods. The PAM algorithm [2] is based o the search for k represetative objects or medoids amog the objects of the dataset. These objects should represet the structure of the data. After fidig a set of k medoids, k clusters are costructed by assigig each object to the earest medoid. The goal is to fid k represetative objects which miimize the sum of the dissimilarities of the objects to their closest represetative object. The algorithm first looks for a good iitial set of medoids. The it fids a local miimum for the objective fuctio, that is, a solutio such that there is o sigle switch of a object with a medoid that will decrease the objective. The k-meas algorithm [5], a usupervised learig algorithm, has bee used to form clusters of gees i gee expressio data aalysis. The algorithm takes the umber of clusters (k) to be calculated as a iput. The umber of clusters is usually chose by the user. The procedure for k- meas clusterig is as follows: 1. First, the user tries to estimate the umber of clusters. 2. Radomly choose N poits ito k clusters. 3. Calculate the cetroid for each cluster. 4. For each poit, move it to the closest cluster. 5. Repeat steps 3 ad 4 util o further poits are moved to differet clusters. The Expectatio-Maximizatio (EM) algorithm [6] is a method for fidig maximum likelihood estimates of parameters i statistical models, where the model depeds o uobserved latet variables. This is a geeral method for optimizig likelihood fuctios ad is useful i situatios where data might be missig or simpler optimizatio methods fail. B. Usupervised clusterig validatio Clusterig validatio is a techique to fid a set of clusters that best fits atural partitios, i.e. umber of clusters, without ay class iformatio. There are two types of clusterig techiques [7], [8]: exteral validatio, based o previous kowledge about data ad iteral validatio, based o the iformatio itrisic to the data aloe. There are three exteral idexes which were used i our previous study [3], Rad idex, Jaccard coefficiet, ad Fowlkes ad Mallows idex, ad they are goig to be used also i this paper. I cotrast to exteral validatio, iteral validatio evaluates the clusterig without ay a priori iformatio. The obtaied values are kow as iteral idexes as they are computed from the data used for clusterig. I this paper we evaluate for the microarray data the followig iteral idexes: silhouette, Caliski-Harabasz idex, Krzaowski- Lai idex, Hartiga idex ad Davies-Bouldi idex. Silhouette idex calculates the silhouette width for each sample, average silhouette width for each cluster ad overall average silhouette width for a total data set: S i i ai a i,bi b, (1) max where a(i) is the average dissimilarity of i-object to all other objects i the same cluster; b(i) is the miimum of average dissimilarity of i-object to all objects i other cluster. The largest overall average silhouette idicates the best clusterig. Caliski-Harabasz idex is defied by: CH k B W k / k / k 1, (2) k where k deotes the umber of clusters, ad B(k) ad W(k) deote the betwee ad withi cluster sums of squares of the partitio, respectively. A optimal umber of clusters is the defied as a value of k that maximizes CH(k). Krzaowski-Lai idex is give by the equatio: k k 1 DIFF KL k, (3) DIFF 2 / p 2 / p where DIFF k ( k 1) Wk 1 ( k ) Wk ad p deotes the umber of features i the data set. A value of k is optimal if it maximizes KL(k). Davies-Bouldi idex is a fuctio of the ratio of the sum of withi-cluster scatter to betwee-cluster separatio: 1 DB S max i1 i j Qi SQ j SQ, Q i j, (4) where is the umber of clusters, S is the average distace of all objects from the cluster to their cluster cetre, S(Q i,q j ) is the distace betwee clusters cetres. Cosequetly, Davies-Bouldi idex will have a small value for a good clusterig. For the review, the mauscript is submitted to the IJATES 2 Editorial Board oly electroically i PDF format. Do ot chage ay formattig; otherwise you udergo the risk to be directly rejected without review process. C. Gee Shavig The method of Gee Shavig is desiged to extract coheret ad typically small clusters of gees that vary as much as 100

3 possible across the samples. Accordig to [9], the algorithm cosists of the followig steps: 1. Start with the etire expressio matrix X, each row cetered to have zero mea 2. Compute the leadig pricipal compoet of the rows of X 3. Shave off the proportio (typically 10 %) of the gees havig smallest absolute ier-product with the leadig pricipal compoet 4. Repeat steps 2 ad 3 util oly oe gee remais 5. This produces a ested sequece of gee clusters, where S k deotes a cluster of k gees. Estimate the optimal cluster size usig the gap statistic [10]. 6. Orthogoalize each row of X with respect to, the average gee i 7. Repeat steps 1-5 above with the orthogoalized data, to fid the secod optimal cluster. This process is cotiued util a maximum of M clusters are foud, where M is chose a priori. Whe implemetig this method we made some chagigs compared with the steps form [9]. But I have made some chages i the algorithm. First of all, we shaved off ot α % gees, but 1 gee each time. This is because we will lose the precisio of the algorithm if usig α%. For example, supposig we remai with S k clusters with k beig 135, 122,..., 53, 47, etc. gees, ad accordig to the gap statistic step, the algorithm decides to have a cluster with 52 or 47 gees, which is ot correct. This is the reaso why we have decided to have clusters with..., 53, 52, 51, 50, 49, 48,... gees, i this way hopig to obtai a maximum Gap(k) closer to ideal 50. Also whe computig the gap statistics we have made some chages. If we cosider all possible permutatios ad after that fidig each D(k), the, i case of 150 gees with 4 characteristics of each gee, it is required to have all possible permutatios of the matrix by permutig the elemets withi each row. This meas that each row has from 4 parameters a umber of 24 possible permutatios, this implies that we have aother sets of matrices, which is too much (we caot take for example 3 rd permutatio from gee 1 with 3 rd permutatio for gee 2, 3 rd permutatio for gee 3, ad so o, because we obtai the same D(k)). We have tried to cosider radom matrices, a umber of 5000, but the problem i this case is that the result is varyig at every other aalysis ad they are also differet at each simulatio. Fially we have decided to use just the first iput matrix ad get its D(k) ad Gap(k), without ay permutatios. A. Usupervised clusterig III. DATA CLUSTERING Three usupervised algorithms were used to cluster microarray data. We used for our study two differet datasets from public Affymetrix databases. The first set was the Chowdary database [11] the authors compared pairs of sap-froze ad RNA later preservative-suspeded tissue from 62 lymph ode-egative breast tumors ad 42 colo tumors, with purpose of separatig them. The secod set [12] cotais 24 acute lymphoblastic leukemia (ALL), 28 acute myelogeous leukemia (AML) ad 20 mixed-lieage leukemia (MLL) samples. For the microarray data the clusterig was doe by a twoway clusterig or bi-clusterig [13] i which both the samples ad the gees are clustered i the same time usig the portioig method. Regardig the coclusios from [3] a useful classificatio was obtaied for microarray data whe EM clustered the gees ad k-meas the samples. I this study we will use the k-meas algorithm to cluster the samples ad the PAM ad respectively the EM algorithm to cluster the gees. Before combiig the algorithms we applied the clusterig validatio methods, both exteral ad iteral idexes. Also, based o these results we combied the usupervised method with a supervised oe. So we clustered the samples by k-meas algorithm ad we classify the gees usig the gee shavig method. I Table I, the umbers of clusters obtaied after usig the iteral ad exteral idexes are idicated. For the k- meas algorithm the umber refers to sample partitioig, while for the PAM ad EM algorithms the umbers refers to gee clusterig. I the EM clusterig validatio oly the exteral idexes were computed. TABLE I THE NUMBER OF CLUSTERS OBTAINED WITH THE CLUSTERING VALIDATION METHODS Chowdary database Leukemia database Idex k- PAM EM k- PAM EM meas meas Rad Jaccard Fowlkes- Mallows Silhouette Caliski- Harabasz Krzaowski -Lai Davies- Bouldi After the validatio was doe we applied the combied clusterig for the datasets. Thereby, for the Chowdary dataset the gees were clustered ito 2 groups, with the PAM method, Fig 2.a, ad the with the EM algorithm, Fig.2.b. The samples were clustered with the k-meas algorithm. The obtaied values were compared with the give values from the microarray databases ad a similarity betwee these values was observed. Fig. 1.a ad 1.b shows all the computed idexes for the Chowdary database, iteral ad exteral, i the case of PAM algorithm. For the exteral idexes the highest values obtaied gave the optimal umber of clusters. I the case of iteral idexes the optimal value was marked by a square i Fig. 1.b. Accordig with the optimal umber of clusters idicated by the validatio idexes, i the case of the leukemia dataset the clusterig was doe i 3 clusters. Thus the samples were group ito three sets by the k-meas algorithm, while the gees formed also three groups oce with the PAM method 101

clusterig, EM clusterig of the gees for the Chowdary dataset.

4 Fig. 1. Values of the exteral idexes, iteral idexes determied for PAM clusterig of gees for Chowdary database Fig. 2. Microarray data clusterig by the usupervised algorithms: PAM clusterig, EM clusterig of the gees for the Chowdary dataset. Fig. 3.a Microarray data clusterig by the usupervised algorithms: PAM clusterig, EM clusterig of the gees for the leukemia dataset. 102

5 classify the Chowdary dataset i 2 clusters. The umber of gee i the first cluster was determied by usig the gap statistic method. The value of 89 gees was give by the maximum of the graph from Fig. 4. I a similar maer we computed the gap statistics for the leukemia dataset ad we were able to classify the data ito 3 clusters: the first oe with 37 gees, the secod oe with 21 gees ad the third oe with the remaiig 14 gees from a total of 72 gees. The values of the Gap fuctio for this dataset are show i Fig 5 ad 6. Fig. 4 The values of the Gap(k) fuctio for the Chowdary dataset. The maximum was obtaied for a umber of 89 gees. IV. CONCLUSION I this paper some usupervised ad supervised clusterig methods were applied to microarray dataset i order to obtai a bi-clusterig of the data with differet algorithms. The selectio of the PAM algorithm for clusterig the gees, despite of the k-meas method was doe because of the robustess of the first oe. The EM algorithm showed some disadvatages whe workig with large datasets that is way the data was previously filtered. Gee shavig, a supervised method was also combied with k-meas i order to classify the gees accordig with the umber of clusters give by the validatio methods. For cluster validatio we used both iteral ad exteral methods by computig some idexes which gave us the optimal umber of clusters for each clusterig method. As future work we take ito cosideratio the combiatio of these methods with other data miig algorithms like Idepedet Compoet Aalysis, which ca be used for large datasets. Fig. 5 The values of the Gap(k) fuctio for the leukemia dataset. For the first cluster the maximum was obtaied for a umber of 37 gees Fig. 6 The values of the Gap(k) fuctio for the leukemia dataset. For the secod cluster the maximum was obtaied for a umber of 21 gees ad the with the EM algorithm, as it ca be see i Fig. 3.a ad Fig. 3.b. For the gee shavig algorithm we used the same umber of clusters as the oe give by the validatio methods. I the case of Chowdary dataset the gees were classified i 2 sets, while i the case of leukemia dataset we applied gee shavig with 3 clusters. Oce applyig the gee shavig method we were able to REFERENCES [1] S. Gozález, L. Guerra, V. Robles, JM. Peña, F. Famili, CliDaPa: A ew approach to combiig cliical data with DNA microarrays Itelliget Data Aalysis Joural, vol 14(2), pp , 2010 [2] J. Ha, M. Kamber, Data miig: Cocepts ad techiques. Morga Kaufma, 2000 [3] R. Maluta, B. Belea, P.G. Vilda, M. Borda, Two way clusterig of microarray data usig a hybrid approach i Proc. of 34th It. Cof. o Telecommuicatios ad Sigal Processig, Budapest, 2011, pp [4] G. McLachla, K-A. Do, C. Ambroise, Aalyzig Microarray Gee Expressio Data. Wiley-Itersciece, 2004 [5] M.B. Zoubi, A. Hudaib, A. Hueiti, B. Hammo, New efficiet strategy to accelerate k-meas clusterig algorithm America Joural of Applied Scieces, vol 5(9), pp , 2008 [6] G. McLachla, T. Krisha, The EM Algorithm ad Extesios. Joh Willey & Sos 2008 [7] N. Bolshakova, F. Azuaje, Cluster validatio techiques for geome expressio data Sigal Processig vol 83, pp , 2002 [8] K. Wag, B. Wag, L. Peg, CVAP: Validatio for Cluster Aalyses Data Sciece Joural vol 8, pp , 2009 [9] T. Hastie et. al., Gee shavig as a method for idetifyig distict sets of gees with similar expressio patters, Geome Biology, vol. I(2), research 0003, pp. 1-21, 2000 [10] W. L. Martiez, A. R. Martiez, Exploratory Data Aalysis with MATLAB, CRC Press LLC, 2005 [11] D. Chowdary et al, Progostic gee expressio sigatures ca be measured i tissues collected i RNAlater preservative J Mol Diag vol 8(1), pp , 2006 [12] S. Armstrog et al, MLL traslocatios specify a distict gee expressio profile that distiguishes a uique leukemia Nature Geetics vol 30, pp , 2001 [13] S. C. Madeira, A. L. Oliveira, Biclusterig algorithms for biological data aalysis: A survey IEEE/ACM Trasactios o Computatioal Biology ad Bioiformatics vol 1 (1), pp ,

Ones Assignment Method for Solving Traveling Salesman Problem

Ones Assignment Method for Solving Traveling Salesman Problem Joural of mathematics ad computer sciece 0 (0), 58-65 Oes Assigmet Method for Solvig Travelig Salesma Problem Hadi Basirzadeh Departmet of Mathematics, Shahid Chamra Uiversity, Ahvaz, Ira Article history: