A New Approach to Web Data Mining Based on Cloud Computing

Similar documents
Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing Environments

Survey on MapReduce Scheduling Algorithms

Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

ABSTRACT I. INTRODUCTION

The Establishment of Large Data Mining Platform Based on Cloud Computing. Wei CAI

A Novel Model for Home Media Streaming Service in Cloud Computing Environment

Improved Balanced Parallel FP-Growth with MapReduce Qing YANG 1,a, Fei-Yang DU 2,b, Xi ZHU 1,c, Cheng-Gong JIANG *

Dynamic Data Placement Strategy in MapReduce-styled Data Processing Platform Hua-Ci WANG 1,a,*, Cai CHEN 2,b,*, Yi LIANG 3,c

A New Model of Search Engine based on Cloud Computing

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2

Analyzing and Improving Load Balancing Algorithm of MooseFS

Efficient Map Reduce Model with Hadoop Framework for Data Processing

Twitter data Analytics using Distributed Computing

Implementation of Parallel CASINO Algorithm Based on MapReduce. Li Zhang a, Yijie Shi b

An Algorithm of Association Rule Based on Cloud Computing

Survey Paper on Traditional Hadoop and Pipelined Map Reduce

CS 61C: Great Ideas in Computer Architecture. MapReduce

Cloud Computing. Hwajung Lee. Key Reference: Prof. Jong-Moon Chung s Lecture Notes at Yonsei University

Keywords Hadoop, Map Reduce, K-Means, Data Analysis, Storage, Clusters.

PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP

Open Access The Three-dimensional Coding Based on the Cone for XML Under Weaving Multi-documents

Web Data mining-a Research area in Web usage mining

SQL-to-MapReduce Translation for Efficient OLAP Query Processing

A REVIEW PAPER ON BIG DATA ANALYTICS

Research on Parallelized Stream Data Micro Clustering Algorithm Ke Ma 1, Lingjuan Li 1, Yimu Ji 1, Shengmei Luo 1, Tao Wen 2

Internet Traffic Classification Using Machine Learning. Tanjila Ahmed Dec 6, 2017

SMCCSE: PaaS Platform for processing large amounts of social media

Obtaining Rough Set Approximation using MapReduce Technique in Data Mining

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

SEQUENTIAL PATTERN MINING FROM WEB LOG DATA

Available online at ScienceDirect. Procedia Computer Science 89 (2016 )

Research on Load Balancing in Task Allocation Process in Heterogeneous Hadoop Cluster

Global Journal of Engineering Science and Research Management

A Robust Cloud-based Service Architecture for Multimedia Streaming Using Hadoop

A computational model for MapReduce job flow

Secure Token Based Storage System to Preserve the Sensitive Data Using Proxy Re-Encryption Technique

Mining Distributed Frequent Itemset with Hadoop

A Hierarchical Document Clustering Approach with Frequent Itemsets

Mitigating Data Skew Using Map Reduce Application

Survey on Incremental MapReduce for Data Mining

EFFICIENT ALLOCATION OF DYNAMIC RESOURCES IN A CLOUD

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET)

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Grid Resources Search Engine based on Ontology

International Journal of Modern Trends in Engineering and Research e-issn No.: , Date: 2-4 July, 2015

Aggregation on the Fly: Reducing Traffic for Big Data in the Cloud

Jumbo: Beyond MapReduce for Workload Balancing

Web Usage Mining: A Research Area in Web Mining

Parallel Nested Loops

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

Tendency Mining in Dynamic Association Rules Based on SVM Classifier

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

I ++ Mapreduce: Incremental Mapreduce for Mining the Big Data

Uday Kumar Sr 1, Naveen D Chandavarkar 2 1 PG Scholar, Assistant professor, Dept. of CSE, NMAMIT, Nitte, India. IJRASET 2015: All Rights are Reserved

Volume 3, Issue 11, November 2015 International Journal of Advance Research in Computer Science and Management Studies

Parallel Computing: MapReduce Jin, Hai

Appropriate Item Partition for Improving the Mining Performance

Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce

CLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM

Online Version Only. Book made by this file is ILLEGAL. Design and Implementation of Binary File Similarity Evaluation System. 1.

Research on the Application of Bank Transaction Data Stream Storage based on HBase Xiaoguo Wang*, Yuxiang Liu and Lin Zhang

Thesis Report: Resource Utilization Provisioning in MapReduce

Open Access Research on the Data Pre-Processing in the Network Abnormal Intrusion Detection

High Performance Computing on MapReduce Programming Framework

Research Article Apriori Association Rule Algorithms using VMware Environment

Web page recommendation using a stochastic process model

A Web Page Segmentation Method by using Headlines to Web Contents as Separators and its Evaluations

Parallel Implementation of Apriori Algorithm Based on MapReduce

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

BIG DATA & HADOOP: A Survey

IJMIE Volume 2, Issue 9 ISSN:

Decision analysis of the weather log by Hadoop

The MapReduce Abstraction

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE

Data Platforms and Pattern Mining

Big Trend in Business Intelligence: Data Mining over Big Data Web Transaction Data. Fall 2012

Implementation of Aggregation of Map and Reduce Function for Performance Improvisation

A priority based dynamic bandwidth scheduling in SDN networks 1

Design of student information system based on association algorithm and data mining technology. CaiYan, ChenHua

Improved Frequent Pattern Mining Algorithm with Indexing

Efficient Load Balancing and Disk Failure Avoidance Approach Using Restful Web Services

Brasilia s Database Administrators

Map Reduce. Yerevan.

Avinash Sharma 1, Dr. N. K. Tiwari 2. Research Scholar, Bansal Group of Institutions, Bhopal, Madhya Pradesh, India

The MapReduce Framework

Efficient Algorithm for Frequent Itemset Generation in Big Data

ANONYMIZATION OF DATA USING MAPREDUCE ON CLOUD

Performance Comparison of Hive, Pig & Map Reduce over Variety of Big Data

Research Article Mobile Storage and Search Engine of Information Oriented to Food Cloud

MapReduce. Kiril Valev LMU Kiril Valev (LMU) MapReduce / 35

Survey of String Similarity Join Algorithms on Large Scale Data

Keywords Data alignment, Data annotation, Web database, Search Result Record

FAST DATA RETRIEVAL USING MAP REDUCE: A CASE STUDY

New research on Key Technologies of unstructured data cloud storage

CS 345A Data Mining. MapReduce

Research on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a

LITERATURE SURVEY (BIG DATA ANALYTICS)!

Web Data Extraction Using Tree Structure Algorithms A Comparison

Transcription:

Regular Paper Journal of Computing Science and Engineering, Vol. 8, No. 4, December 2014, pp. 181-186 A New Approach to Web Data Mining Based on Cloud Computing Wenzheng Zhu* and Changhoon Lee School of Computer Science, Konkuk University, Seoul, Korea wenzheng@live.co.kr, chlee@konkuk.ac.kr Abstract Web data mining aims at discovering useful knowledge from various Web resources. There is a growing trend among companies, organizations, and individuals alike of gathering information through Web data mining to utilize that information in their best interest. In science, cloud computing is a synonym for distributed computing over a network; cloud computing relies on the sharing of resources to achieve coherence and economies of scale, similar to a utility over a network, and means the ability to run a program or application on many connected computers at the same time. In this paper, we propose a new system framework based on the Hadoop platform to realize the collection of useful information of Web resources. The system framework is based on the Map/Reduce programming model of cloud computing. We propose a new data mining algorithm to be used in this system framework. Finally, we prove the feasibility of this approach by simulation experiment. Category: Smart and intelligent computing Keywords: Web data mining; Cloud computing; Hadoop; Map/Reduce programming model I. INTRODUCTION We live and operate in the world of computing and computers. The Internet has drastically changed the computing world from the concept of parallel computing to distributed computing, to grid computing, and now to cloud computing [1]. With the rapid development of Internet technology, the data in the Internet is growing exponentially, so how to find and mine valuable information has become a hot area of research. Web data mining [2] aims to discover useful information or knowledge from Web hyperlinks, page contents, and usage logs. Based on the primary kinds of data used in the mining process, Web data mining tasks can be categorized into three main types: Web structure mining, Web content mining, and Web usage mining. Web structure mining discovers knowledge from hyperlinks, which represent the structure of the Web. Web content mining extracts useful information or knowledge from Web page content. Web usage mining mines user access patterns from usage logs, which record the clicks made by every user. Basically, data mining technique is used in Web mining. But there are some differences. In traditional data mining, the data is often already collected and stored in a data warehouse. For Web data mining, data collection can be a substantial task especially for Web structure and content mining, and involves crawling a large number of target Web pages. Web data mining is an extended version of data mining. As we observed, the Internet has now changed computing to cloud computing. Map/Reduce is a great programming model in cloud computing [3] that was introduced by Google. It is well suited to the execution of Open Access http://dx.doi.org/10.5626/jcse.2014.8.4.181 http://jcse.kiise.org This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Received 12 May 2014; Revised 17 August 2014; Accepted 18 August 2014 *Corresponding Author Copyright 2014. The Korean Institute of Information Scientists and Engineers pissn: 1976-4677 eissn: 2093-8020

Journal of Computing Science and Engineering, Vol. 8, No. 4, December 2014, pp. 181-186 large distributed jobs in a cloud infrastructure. In brief, a Map/Reduce [4] computation executes as follows: some map tasks are given one or more chunks from a distributed file system. Each of these map tasks turns the chunk into a sequence of key-value pairs, and these pairs are written to local disk as intermediate files partitioned into R (the number of reduce tasks) regions by the partitioning function. The locations of these regions are passed back to the master, who is responsible for forwarding these locations to the reduce tasks. Each of the R reduce tasks is responsible for one of these regions applying reduction. So, all key-value pairs with the same key wind up at the same reduce task. The reduce tasks work on one key at a time, and combine all the values associated with that key in a user-defined way. In this paper, we propose a Map/Reduce approach to realize Web data mining. II. WEB DATA MINING Web data mining techniques are the result of a long process of research and product development. Web data mining is based on knowledge from the Web; it aims to discover useful information or knowledge from Web hyperlinks structure, page contents, and usage data [5]. Although Web data mining uses many data mining techniques, it is not purely an application of traditional data mining, due to the heterogeneity and semi-structured or unstructured nature of the Web data. Many new mining tasks and algorithms have been invented in the past decade. Based on the primary kinds of data used in the mining process, Web data mining tasks can be categorized into three types as shown in Fig. 1. We can graphically define Web structure mining. The Web pages are represented as nodes, and hyperlinks are represented as edges. Basically, the graph shows the relationship between user and Web. The motive of Web structure mining is generating structured summaries about information on Web pages. The summaries show the links of one Web page to another Web page. Traditional data mining does not perform such tasks, because there is usually no link structure in a relational table. Web data mining is basically extracting the information on the Web. The process that accesses the information on the Web is Web content mining. Many pages are open to information access on the Web. These pages are the content of the Web. Searching the information and open search pages is also the content of the Web. Finally, accurate results are defined as the result pages of content mining. These tasks are similar to those in traditional data mining. However, we can also discover patterns in Web pages to extract useful data for many purposes, such as descriptions of products, or postings of forums. Furthermore, we can mine customer reviews and forum postings to discover consumer sentiments. These are not traditional data mining tasks. Web usage mining is the discovery of meaningful pattern from data generated by client server transactions on one or more Web localities. A Web is a collection of inter-related files on one or more Web servers. It is automatically generated data stored in server access logs, reference logs, agent logs, client side cookies, user profiles, metadata, page attributes, page content, and site structure. One of the key issues in Web usage mining is the preprocessing of click stream data in usage logs in order to produce the right data for mining. III. CLOUD COMPUTING AND MAP/REDUCE MODEL Hadoop is an open source distributed computing framework [6], which is used for distributed processing of large data sets and designed to satisfy clusters scaled from a single server to thousands of servers. Hadoop is the most widely used cloud computing platform in recent years and has been adopted by major Internet companies and research institutions [7]. A Hadoop cluster is composed of two parts: the Hadoop distributed file system and Map/Reduce. Hadoop is the optimal choice to realize our approach. The Hadoop and Map/Reduce communities have developed a powerful framework for performing predictive analytics against complex distributed information sources [8]. So in this paper, our simulation experiment is designed based on it. A. Cloud Computing Fig. 1. Classification of Web data mining. Cloud computing is a new term for a long-held dream of computing as a utility [9], which has recently emerged as a commercial reality. Cloud computing refers to both the application delivered as services over the Internet, and the hardware and system software in the data centers that provide these services. http://dx.doi.org/10.5626/jcse.2014.8.4.181 182 Wenzheng Zhu and Changhoon Lee

A New Approach to Web Data Mining Based on Cloud Computing The main objective of cloud computing is to make better use of distributed resources and to solve large-scale computation problems [10]. For example, cloud computing can focus the power of thousands of computers on one problem, enabling researchers to do their work faster than ever. For Web data mining, we use this function of cloud computing to aim at mass data. B. Map/Reduce Architecture Map/Reduce is a programming model for processing large data sets [11], which was originally proposed by Google [12]. The framework is designed to orchestrate the work on distributed nodes, and run various computational tasks in parallel providing at the same time for redundancy and fault tolerance. Distributed and parallelized computations are the key mechanisms that make the Map/Reduce framework very attractive to use in a wide range of application areas that include data mining, bioinformatics, and business intelligence. Nowadays, it is becoming increasingly popular in cloud computing. The Map/Reduce programming model is used for parallel and distributed processing of large data sets on clusters [13]. There are two basic procedures in Map/Reduce: Map and Reduce. Fig. 2 shows an execution overview. Typically, the input and output are both in the form of key-value pairs. After the input data is partitioned into splits of appropriate size, the map procedure takes a series of key-value pairs and generates processed keyvalue pairs, which are passed to a particular reducer by a certain partition function; later, after data sorting and shuffling, the reduce procedure iterates through the values that are associated with a specific key and produces zero or more outputs. IV. SYSTEM FRAMEWORK In this paper, we built a new system framework to implement Web data mining. First, after the data is collected from the Web, the mass data on the Web must be filtered [14], cleaned, transformed, and combined into XML files; and then the files are saved on the distributed data nodes. Each file must be divided into small fixedsize blocks, copied and stored on different cluster node disks for backup. In this work, we define that each file can be copied two times and the two copies are stored onto two different cluster nodes. This system framework can solve the general problems of data missing in storage capacity expansion, and server failure caused by Web data mining [15]. Second, the master is responsible for controlling the entire works, creating the attached task to idle data nodes on the Web. The data node reports the status and result to the master. Then, the master is responsible for combining all the results to the client. A. System Structure Overview In our system framework, there are five types of nodes: client, master, name node, algorithm node, and data note as shown in Fig. 3. The Client as a user submits a task and receives a result. The Master controls the whole workflow, creates the attached task, invokes the Map/Reduce scheduler to assign tasks to the data nodes, and regulates client accesses to the data. The name node divides the XML file into 64 M fixed-size blocks to idle data nodes, sends the IP address of the data nodes to the master, copies and stores the blocks into the other data notes, and maintains a mapping table (the information of data blocks mapped onto the data node) in order to process write and read requests from the client, just like the name node stores the metadata of each XML file. But the real metadata is not stored on the name node, as it is just the IP addresses of XML files and the information of XML file copies, etc. The algorithm node stores the algorithms for supporting the requests from the master and sends the appointed algorithm to the data nodes. The data node stores the data blocks of XML files in its local disk and executes instructions. Fig. 2. Execution overview of Map/Reduce. Fig. 3. System structure. Wenzheng Zhu and Changhoon Lee 183 http://jcse.kiise.org

Journal of Computing Science and Engineering, Vol. 8, No. 4, December 2014, pp. 181-186 A data node periodically reports its status (idle, in progress, or completed) through a heartbeat message, and asks the name node for instructions. The heartbeat can also help the name node to detect connectivity with its data node. If the name node does not receive a heartbeat from a data node in the configured period of time, it marks the node down. The data blocks stored on this node will be considered lost, and the name node will automatically replicate those blocks of this lost node onto some other data nodes. In this system framework, the data processing is executed and the data is stored on data nodes rather than transferring the executed data to the master. The master just receives the results, after the Reduce scheduler. So this means that mass data flow is not transferred in the network; and as a result, a lot of time can be saved. B. A New Presented Algorithm In an ordinary data mining algorithm, there are two steps: generate all frequent item sets and generate all confident association rules from the frequent item sets. The most important part is the frequent item sets in data mining. As for computing the frequent item sets, there is a method presented as: first generating frequent item set 1-item set L 1, then generating frequent item set 2-item set L 2, the algorithm will be continued until some value of K can be supported for L k to be null set. When it needs to strive for L k, candidate items C k can be computed by L k-1. Then by checking every item of C k, we can get the item that belongs to L k which can satisfy the minimum support threshold by the client defined. But since the web data is mass data, it will take a lot of time and space to determine C k. So in this section, we present a new algorithm to confirm that all frequent item sets achieve high efficiency. We use the processing method presented in [7] to get the processed document d i. The document will be the content input, and the set of blocks is <client, doc_id1>. By discovering the frequency of the feature in document, the output can be indicated to be <t, <n, f>>. After the first time of Map/Reduce scheduler, the key/value becomes (key 1, value 1 ), key 1 =term 1, value 1 =<n 1, f 1 > <n 2, f 2 >. This set would be executed by Map/Reduce as the data input, where key is critical value, and value is the local frequent item sets. In the second processing, by using the local frequent item set computed by the first step, we can compute the second result (key 2, value 2 ), key 2 =term 2, value 2 =<n' 1, f' 1 ><n' 2, f' 2 >, where we get an improved local frequent item sets again. Then, after the third time, we can compute the global frequent item set. This algorithm can improve the efficiency of data mining. In this algorithm, we use the following formula: T ij = F ij * e log (n/t). T ij is the threshold of the present data mining, F ij is the threshold of the past data mining, n is the number of feature extraction in this document, t is the frequency of feature extraction, and e log (n/t) is the ratio of last frequent item set. C. Algorithm Implementation The implementation of the algorithm we presented can be described as: The first time: Map for all term1 client.doc do; Frequency(t)=Frequency(t)+1; Output (term1, record<client, doc_id1, Frequency(t)>); The first time: Reduce Input term1, record (<n 1, f 1 ><n 2, f 2 > ) Build list R for all record <n, f> record (<n 1, f 1 ><n 2, f 2 > ) Append (R, <n, f>) Find (R) Output (term1, recordr) The second time: Map for all term2 term1 do for <clientn, Frequency(t)> record R do h=h+1 T(t,n)=frequency(t)*e log(n/t) Append (term2, T(<n 1, t 1 ><n 2, t 2 > )) The second time: Reduce for <client.doc1,t(t)> T(W) if T(W)>Y Append (T_new<n, T(t,n)>) Find (T_new) The third time: Map for all term3 term2 do for <client, Frequency(t)> record R do h=h+1 T(t,n)=frequency(t)*e log(n/t) *e log(n/t) Append(term3,T(<n 1, t 1 ><n 2, t 2 > )) The third time: Reduce for <client.doc1,t(t)> T(W) if T(W)=Y Find(0) Through the three time applications of Map/Reduce, we achieve the minimum support threshold. Finally, the association rule can be generated using the threshold. The result will be delivered to the client. V. SIMULATION EXPERIMENT In this section, we design a simulation experiment to prove the possibility of the Web data mining we presented. This simulation experiment testing environment is in a local area network; the platform is Hadoop [9], and consists of seven computers. The computer configuration is: Intel core duo 2.7 G CPU, 2 G DDR3 memory, and Linux operating system. One of the computers is assigned to be the master, one to the name node, one to the algorithm node, and the others to data nodes. The name node divides the data into 10 sub-files and http://dx.doi.org/10.5626/jcse.2014.8.4.181 184 Wenzheng Zhu and Changhoon Lee

A New Approach to Web Data Mining Based on Cloud Computing REFERENCES copies and stores the files to idle data nodes. The algorithm node stores the algorithms and sends the appointed algorithm to data nodes. The master controls the whole workflow. We presented the function of each node in Section IV above. First test: We define that the experiment must implement traditional data mining, and record the execution time. Second test: We define that the experiment must implement the data mining using the three time applications of Map/Reduce that we have presented. Then we record the execution time. Finally, when we get the experimental result, we compare the executed consideration and execution time of the two tests. We can get the experimental result as shown in Fig. 4. VI. CONCLUSION Through comparing the two tests, the experimental result shows that this new approach can improve the execution efficiency and reduce the execution time. The new algorithm can work well and there is no association rule lost. It can be well used in business. In this work, we noticed that we can aim to find a more accurate and faster approach for Web data mining, also based on cloud computing. We will keep on improving the algorithm we presented. AKNOWLEDGMENTS Fig. 4. Experimental result. This research would not have been possible, if we had not been provided with the opportunity to use the computer lab at Konkuk University, Seoul. Therefore, we would like to sincerely thank all the staff and lecturers who helped conduct this research and who made this paper possible. 1. M. Armbrust, A. Fox, G. Rean, A. Joseph, R. Katz, A. Konwinski, L. Gunho, P. David, A. Rabkin, I. Stoica and M. Zaharia, Above the clouds: a Berkeley view of cloud computing, Department of Electrical Engineering and Computing Sciences, University of California at Berkeley, Tech. Rep. UCB/EECS-2009-28, 2009. 2. C. H. Chang, M. Kayed, M. R. Girgis, and K. F. Shaalan, A survey of Web information extraction systems, IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 10, pp. 1411-1428, 2006. 3. Wikipedia, Cloud computing, http://en.wikipedia.org/wiki/ Cloud_computing. 4. J. Dean and S. Ghemawat, MapReduce simplified data processing on large clusters, in Proceedings of the 6th Symposium on Operating System Design and Implementation, San Francisco, CA, 2004, pp. 137-150. 5. R. Cooley, B. Mobasher, and J. Srivastava, Web mining: information and pattern discovery on the World Wide Web, in Proceedings of the 9th IEEE International Conference on Tools with Artificial Intelligence, Newport Beach, CA, 1997, pp. 558-567. 6. Hadoop, http://hadoop.apache.org. 7. Y. Tao, W. Lin, and X. Xiao, Minimal MapReduce algorithms, in Proceedings of the ACM SIGMOD International Conference on Management of Data, New York, NY, 2013, pp. 529-540. 8. M. J. Fischer, X. Su, and Y. Yin, Assigning tasks for efficiency in Hadoop: extended abstract, in Proceedings of the 22nd ACM Symposium on Parallelism in Algorithms and Architectures, Santorini, Greece, 2010, pp. 30-39. 9. W. W. Lin, An improved data placement strategy for Hadoop, Journal of South China University of Technology: Natural Science, vol. 40, no. 1, pp. 152-158, 2012. 10. C. Gong, J. Liu, Q. Zhang, H. Chen, and Z. Gong, The characteristics of cloud computing, in Proceedings of the 39th International Conference on Parallel Processing, San Diego, CA, 2010, pp. 275-279. 11. D. Jiang, B. C. Ooi, L. Shi, and S. Wu, The performance of MapReduce: an in-depth study, Proceedings of the VLDB, vol. 3, no. 1-2, pp. 472-483, 2010. 12. X. L. Lu and J. M. He, Study on cloud storage model of Map/Reduce-based index data, Journal of Ningbo University, vol. 24, no. 3, pp. 29-33, 2011. 13. R. Lammel, Google s MapReduce programming model - revisited, Science of Computer Programming, vol. 70, no. 1, pp. 1-30, 2008. 14. M. S. Chen, J. Han, and P. S. Yu, Data mining: an overview from a database perspective, IEEE Transaction on Knowledge and Data Engineering, vol. 8, no. 6, pp. 866-883, 1996. 15. Z. Bar-Yossef and S. Rajagopalan, Template detection via data mining and its applications, in Proceedings of the 11th International Conference on World Wide Web, Honolulu, HI, 2002, pp. 580-591. Wenzheng Zhu and Changhoon Lee 185 http://jcse.kiise.org

Journal of Computing Science and Engineering, Vol. 8, No. 4, December 2014, pp. 181-186 Wenzheng Zhu Wenzheng Zhu received his B.S. degree from the Department of Computer Science of Wonkwang University, Korea, and his M.S. degree from the Department of Computer Science of Konkuk University, Korea, where he is currently working towards his Ph.D. degree. His research interests include cloud computing, OS, and information security. Changhoon Lee Changhoon Lee is a professor in the Department of Computer Science at Konkuk University. He received his B.S. degree from the Department of Mathematics of Yonsei University, Korea, and his M.S. and Ph.D degrees from the Department of Computer Science of KAIST, Korea. His research interests are in the areas of AI, OS, and information security. http://dx.doi.org/10.5626/jcse.2014.8.4.181 186 Wenzheng Zhu and Changhoon Lee