HADOOP: A NEW APPROACH FOR DOCUMENT CLUSTERING

Y.K. Patil* Iteratioal Joural of Advaced Research i ISSN: 2278-6244 IT ad Egieerig Impact Factor: 4.54 HADOOP: A NEW APPROACH FOR DOCUMENT CLUSTERING Prof. V.S. Nadedkar** Abstract: Documet clusterig is oe of the importat areas i data miig. Hadoop is beig used by the Yahoo, Google, Face book ad Twitter busiess compaies for implemetig real time applicatios. Email, social media blog, movie review commets, books are used for documet clusterig. This paper focuses o the documet clusterig usig Hadoop. Hadoop is the ew techology used for parallel computig of documets. The computig time complexity i Hadoop for documet clusterig is less as compared to JAVA based implemetatios. I this paper, authors have proposed the desig ad implemetatio of Tf- Idf, K-meas ad Hierarchical clusterig algorithms o Hadoop. Key words: Hadoop, Tf-Idf, K-meas ad Hierarchical clusterig. *M.E studet, PVPIT Pue **HOD IT dept. PVPIT, Pue Vol. 3 No. 7 July 214 www.garph.co.uk IJARIE 1

Iteratioal Joural of Advaced Research i ISSN: 2278-6244 IT ad Egieerig Impact Factor: 4.54 I. INTRODUCTION The large amout of texts is available o iteret, from huge corpus text eed to be processed withi a short period. We ca implemet documet clusterig usig programmig laguage with parallel executio but i this approach some issues are like fault tolerace, iter processor commuicatio ad task schedulig. Research o distributed documet clusterig based o MapReduce has bee doe [1]. Storig huge data ad retrievig the text of a documet i a distributed system is a alterative method [2]. I distributed eviromet the task is divided ad scheduled to appropriate system. I this paper we describe the documet clusterig based o Hadoop uses the MapReduce procedure for implemetig k-meas ad Hierarchical clusterig. Hadoop provide distributed eviromet with Hadoop distributed file system (HDFS) ad MapReduce fuctio. HDFS is used to store the large data ad data is store o sigle or multiple odes. Where oe ode is called master ode ad all other are data (Workers) odes the master ode maages the file system amespace. This iformatio is stored persistetly o the local disk i the form of two files the amespace image ad the edit log It maitais the file system tree ad the metadata for all the files ad directories i the tree. The ameode also kows the dataodes o which all the blocks for a give file are located, however, it does ot store block locatios persistetly, sice this iformatio is recostructed from dataodes whe the system starts.mapreduce is framework where the documets are processed with Map ad Reduce fuctio to achieve parallelism o huge Data set. MapReduce fuctio works just like divide ad merge strategy. I the proposed system iput is differet documets ad the output is clustered ad tree based (Hierarchical) clustered. I documet pre-processig phase the Tf-Idf is determied o MapReduce. This Tf-Idf is importat to idetify the documet from the corpus. I this paper iitially the system architecture of proposed system is briefly explaied. The the Algorithm for implemetatio of K-meas ad hierarchical clusterig is preseted. Fially cocluded that how the proposed system is good for documet clusterig. II. RELATED WORK Documet Clusterig is the task of groupig a set of documets i such a way that objects i the same group are more similar to each other tha to those i other groups clusters Basically there are two mai approaches i documet clusterig that are Partitio Vol. 3 No. 7 July 214 www.garph.co.uk IJARIE 2

Iteratioal Joural of Advaced Research i ISSN: 2278-6244 IT ad Egieerig Impact Factor: 4.54 clusterig & Hierarchical clusterig[3][4]. The authors Jiawei Ha, Michelie Kamber, they have preseted various clusterig Techiques for Data Miig. They have published how to use K-meas clusterig algorithm to cluster large data set i various disjoit clusters [5]. The author Bejami C.M. Fug, Ke Wag, Marti Ester, preseted Hierarchical Documet Clusterig Usig Frequet Item sets. They have focused o how to maage the documets i hierarchical clusters [6]. The author Tia Xia, preseted a research work o A Improvemet to TF-IDF: Term Distributio based Term Weight Algorithm, this paper talks about how to fid Tf-Idf weight for documet clusterig [7]. The authors Tom White Shvachko, Hairog Kuag preseted a paper o Hadoop: The Defiitive Guide. Authors have discussed the workig of Map ad Reduce methods i Hadoop system [8]. The Hadoop Distributed File System paper focused o the Hadoop implemetatio usig sigle ode implemetatio as well as multimode implemetatio details [9]. Our approach effectively combies the K-meas ad Hierarchical clusterig algorithm. III SYSTEM ARCHITECTURE The Figure 1 is proposed system architecture. I Figure 1the iput to system is pdf documets. The Hadoop which uses the MapReduce fuctio for parallel computig of documets. The parallel processig reduces the time complexity. I this architecture documets are parallely executed to fid Tf-Idf. Before givig the iput to our system we eed to perform Preprocessig ad Text processig of collected documets ad eed to fid Tf-Idf as well as cosie similarity. A. Preprocessig Phase: A.1 Parser Phase: 1. Parse Pdf file ad store Pdf stream i radom access file create usig COS Documet costructor. PDFParser class is used to parse the Pdf documets. This class eeds a FileIputStream ad File as parameter to parse a dot pdf file. PDFParser parser = ew PDFParser (ew FileIputStream (ew File ())); COSDocumetcosdoc = parser.getdocumet (); 2. Extract Text from radom access file. I this process PDFTextStripper costructor class is used. PDFTextStripper Stripper = ew PDFTextStripper (); PDDocumet Doc = ew PDDocumet (cosdoc); Vol. 3 No. 7 July 214 www.garph.co.uk IJARIE 3

Iteratioal Joural of Advaced Research i ISSN: 2278-6244 IT ad Egieerig Impact Factor: 4.54 Strig parsedtext = Stripper.getText (Doc); Figure 1 Map-Reduce Programmig model for proposed System 3. Write parsed text i ewly created text file A.2 Text Processig: 1. Read iput strig form text file ad tokeize it e.g. Strig lie = Data miig is the process of kowledge discovery, i this strig we eed to fid tokes. Tokes are Data, miig, is, the, process, of, kowledge, discovery, but is, the, of are stop words so these words eed to be removed as well as some of the words are coverted to its root form e.g. miig is mie 2. Read tokeized words oe by oe ad removes puctuatios from iput lie such as comma, dot etc e.g. Iput toke: discovery, Output toke: discovery Iput toke: patters. Output toke: patters Vol. 3 No. 7 July 214 www.garph.co.uk IJARIE 4

Iteratioal Joural of Advaced Research i ISSN: 2278-6244 IT ad Egieerig Impact Factor: 4.54 B. To determie Tf-Idf ad Cosie similarity: The Tf-Idf weight ad cosie similarity are calculated as give i eq.1 ad eq.2. Tf-Idf (Term Iverse Documet Frequecy) a mechaism for calculatig the effect of terms that occur so frequetly i corpus. The cosie similarity is used to fid how the documets are closely similar to each other i terms to do cluster. The idea of determiig the Tf-Idf ad cosie are give i [9]. Tf Idf = log( ). (1) df t Where, Tf-Idf=Iverse documet frequecy of term =Number of documets i corpus df t =Frequecy of a term t i corpus i xi x = i y cos( x, y) =. (2) 2 2 y Where, i= i i= x i =is the Tf-Idf weight of i th term i first documet y i =is the Tf-Idf weight of i th term i secod documet i IV. K-MEANS AND HIERARCHICAL AGGLOMERATIVE ALGORITHM Algorithm 1: K-Meas Documets Clusterig Algorithm Iput: Set of pdf documets. K-umber of cluster, Step 1: Idetifyig uique words from the give iput documet Step 2: Geeratio of iput vector usig TF-IDF weightig Step 3: Selectio of similarity measure for geeratig similarity matrix. Here i this project cosie similarity is used. Step 4: Specifyig the value of k i.e. umber of clusters. Step 5: Radomly select k documets ad place oe of k selected documets i each cluster. Step 6: Place the documets i cluster based o similarity betwee documets ad the documets preset i the clusters. Step 7: Compute cetroids for each cluster. Step 8: Agai by usig similarity measures, fid the similarity betwee the cetroids ad the iput documets. Vol. 3 No. 7 July 214 www.garph.co.uk IJARIE 5

Iteratioal Joural of Advaced Research i ISSN: 2278-6244 IT ad Egieerig Impact Factor: 4.54 Step 9: Now place the documets i the clusters based o similarity betwee documets ad the cetroids of clusters. Step 1: After placig all the documets i the clusters, compare the precious iteratio clusters with curret iteratio clusters. Step 11: If all the clusters cotais same documets i previous ad curret iteratio the termiate the algorithm here ad hece foud the clusters Step 12: Else repeat through step-7 Output: The etire cluster cotais the same documet Algorithm 2: Hierarchical Documets Agglomerative Clusterig Algorithm Iput: umber of pdf documets. 1. Create folders. 2. foreach (documet) do { put i idividual folder. } ed foreach 3. foreach documet of each folder do { Map (documet) Calculate TF-IDF weight value } ed foreach 4. foreach calculated documet do { Reduce (documet) if (Tf-Idf matches with other documets) { Merger folders } ed if } ed foreach Vol. 3 No. 7 July 214 www.garph.co.uk IJARIE 6

Iteratioal Joural of Advaced Research i ISSN: 2278-6244 IT ad Egieerig Impact Factor: 4.54 5. Fially merged all disjoit folders i a root folder. Output: Hierarchical Agglomerative clustered documets. Here we preset the Hierarchical Agglomerative clusterig algorithm where the iput is pdf documets. The documets are parallely processed usig MapReduce fuctio to determie the Tf-Idf. The iitial clusters are chose from the corpus ad each cluster is kept i a separate folders. The details procedure for clusterig is give i Algorithm 2. IV. MATH Step 1: Tf Idf = log( ). (1) df t Where, Tf-Idf=Iverse documet frequecy of term =Number of documets i corpus df t =Frequecy of a term t i corpus Step 2: i xi x = i y cos( x, y) =. (2) 2 2 y Where, i= i i= x i =is the Tf-Idf weight of i th term i first documet. y i =is the Tf-Idf weight of i th term i secod documet V. EXPECTED RESULT Expected Result of this Project are 1. TF-Idf values of each word of each file. 2. Partitio Clusters 3. Hierarchical Clusters VI. CONCLUSION AND FUTURE SCOPE i The paper has itroduced a documet clusterig based o Hadoop distributed system. The cotributio of work is o desig of system architecture ad to implemet the K-meas ad Hierarchical documet clusterig. We believe that our work is the example for programmig paradigm. Our work ca be exteded further for email documet clusterig, face book Vol. 3 No. 7 July 214 www.garph.co.uk IJARIE 7

Iteratioal Joural of Advaced Research i ISSN: 2278-6244 IT ad Egieerig Impact Factor: 4.54 commets ad movie review commets clusterig. Hadoop has bee provided a platform to implemet the real world problem ad to reduce the computig time complexity. VII. ACKNOWLEDGEMENT I would like to give my sicere gratitude to our guide Prof. V.S. Nadedkar who ecouraged ad guided me throughout this paper. I am especially grateful H.O.D Prof. Y. B. Gurav for their valuable guidace ad ecouragemet. I also thaks to PG CO-ORDINATOR, Prof. N.D. Kale for their valuable guidace ad for allowig me to use the college resources ad facilities provided by them. Last but ot the least, may thaks ad deep regards to Pricipal sir for their support ad ecouragemet. REFERENCES [1] Cof Jia Wa, Wemig Yu ad Xiaghua Xu, Desig ad Implemetatio of Distributed Documet Clusterig Based o MapReduce, i proc. 2 d ISCST., 213, pp. 278- [2] Jigxua Li Bo shao, Tao Li ad Mitsuori Ogihara, Hierarchical Co- Clusterig: A New Way to Oragaize the Music Data, IEEE Tras. vol.14, o. 2, pp. 471-481, 212 [3] Hierarchical co-clusterig- A ew way to orgaize the music data, Published by jigxuam Li,Bo Sh. IEEE Tras 21. [4] Eui-Hog Ha ad George Karypis Cetroid-based Documet Classificatio: Aalysis & Experimetal Result, 21. [5] Jiawei Ha, MichelieKamber, Data Miig Cocepts ad Techiques, Secod Editio, Morga Kaufma Publishers, 26 Elsevier I. [6] Bejami C.M. Fug, Ke Wag, Marti Ester, Hierarchical Documet Clusterig Usig Frequet Itemsets,by SIAM, pp. 59-7,23. [7] Tia Xia, A Improvemet TF-IDF:Termlished by Joural Distributio based Term Weight Algorithm, published by Joural of SW Egg. Vol. 6, No. 3, March 211. [8] Tom White, Hadoop: The Defiitive Guide, First Editio, Published by Reilly Media, Jue 29, i Uited States of America.ao,Tao Li, IEEE trasactio,212. [9] Kostati Shvachko, HairogKuag, The Hadoop Distributed File System, Published by IEEE, 21. [1] Chu, C.-T., S.K. Kim, ad Y.-A. Li, Mapreduce for machie learig o multicore, i I Proceedigs of Neural Iformatio Processig Systems Coferece (NIPS) 27. Vol. 3 No. 7 July 214 www.garph.co.uk IJARIE 8