Near Duplicate URL Detection for Removing Dust Unique Key

Size: px
Start display at page:

Download "Near Duplicate URL Detection for Removing Dust Unique Key"

Transcription

1 Volume-7, Issue-5, September-October 2017 International Journal of Engineering and Management Research Page Number: Near Duplicate URL Detection for Removing Dust Unique Key R.Vijaya Santhi Research Scholar, Department of Computer Science, Tamil University College, Thanjavur, Tamil Nadu, INDIA ABSTRACT Regular parallel mining algorithms for mining frequent item sets intends to balance load by equally partitioning data among a group of computing nodes. But those existing parallel Frequent Item set Mining algorithms has serious performance issues. In big data environment existing mining algorithm suffer high communication and mining overhead induced by redundant data transmitted among computing nodes. We explore this problem by developing a data partitioning approach using the MapReduce programming model. The aim of this paper is to enhance the performance of parallel Frequent Item set mining on Hadoop clusters. Incorporating the similarity metric and the Locality-Sensitive Hashing technique, in this proposed model VUK (Valid Unique Key) DUST removing technique LDA-CRATS mining data is used to run this approach. This approach is to derive quality rules that take advantage of a multi-sequence alignment strategy. It demonstrates that a full multi-sequence alignment of URLs with duplicated content, before the generation of the rules, can lead to the deployment of very effective rules. By evaluating this method, it observed it achieved larger reductions in the number of duplicate URLs than our best baseline, with gains of 85 to percent in two different web collections. Keywords-- MapReduce, URL duplicated, multisequence, Hadoop I. INTRODUCTION The dust problem is that: The web is abundant with dust, different URLs with Similar Text. For example, the URLs and return similar content. A single web server often has multiple DNS names, and any can be typed in the URL. Many are artifacts of a particular web server implementation. For example, URLs of dynamically generated pages often include parameters; which parameters impact the page s content is up to the software that generates the pages. Some sites use their own conventions; for example, a forum site we studied allows accessing story number num both via the URL and via num. Our study of the CNN web site has discovered that URLs of the form get redirected to Universal rules, such as adding or removing a trailing slash are used, in order to obtain some level of canonization. By knowing dust rules, one can dramatically reduce the overhead of this process. But how can one learn about site-specific dust rules? Detecting dust from a URL list. Most of our work therefore focuses on substring substitution rules, which are similar to the replace function in many editors. Dust Buster uses three heuristics, which together are very effective at detecting likely dust rules and distinguishing them from false rules. The first heuristic is based on the observation that if a rule α β is common in a web site, then we can expect to find in the URL list multiple examples of pages accessed both ways. For example, in the site where story? 52 Copyright Vandana Publications. All Rights Reserved. II. LITERATURE SURVEY 2.1 URL NORMALIZATION FOR DE- DUPLICATION OF WEB PAGES Presence of duplicate documents in the World Wide Web adaversely abets crawling, indexing and relevance, which are the core building blocks of web search. In this paper, we present a set of techniques to mine rules from URLs and utilize these learnt rules for de-duplication using just URL Strings without fetching the content explicitly. Our technique is composed of mining the crawl logs and utilizing clusters of similar pages to extract specific rules from URLs belonging to each cluster. Preserving each mined rules for deduplication is not ancient due to the large number of specific rules. We present a machine learning technique to generalize the set of rules, which reduces the resource foot-print to be usable at web-scale. The rule extraction tech-inquest is robust against web-site specific URL conventions. We demonstrate the electiveness of our techniques through Experimental evaluation. 2.2 DO NOT CRAWL IN THE DUST: DIFFERENT URLS WITH SIMILAR TEXT We focus on URLs with similar contents rather than identical Ones, since different versions of the same

2 document are not always identical; they tend to differ in insignificant ways, e.g., counters, dates, and advertisements. Likewise, some URL parameters impact only the way pages are displayed (Fonts, image sizes, etc.) without altering their contents. Detecting DUST from a URL list. Contrary to initial intuition, we show that it is possible to discover likely dust rules without fetching a single web page. We present an algorithm, Dust Buster, which discovers such likely rules from a list of URLs. Such a URL list can be obtained from many sources including a previous crawl or web server logs.1 The rules are then verified (or refuted) by sampling a small number of actual web pages. At first glance, it is not clear that a URL list can provide reliable information regarding dust, as it does not include actual page contents. 2.3 THE TREC 2006 TERABYTE TRACK The primary goal of the Terabyte Track is to develop an evaluation methodology for terabyte-scale document collections. In addition, we are interested in efficiency and scalability issues, which can be studied more easily in the context of a larger collection. TREC 2006 is the third year for the track. The track was introduced as part of TREC 2004, with a single adhoc retrieval task. For TREC 2005, the track was expanded with two optional tasks: a named page finding task and an efficiency task. These three tasks were continued in 2006, with 20groups submitting runs to the adhoc retrieval task, 11 groups submitting runs to the named page finding task, and 8 groups submitting runs to the efficiency task. This report provides an overview of each task, summarizes the results, and outlines directions for the future. Further background information on the development of the track can be found in the 2004 and 2005 track reports. 2.4 THE RESEARCH OF WEB PAGE DE- DUPLICATION BASED ON WEB PAGES RESHIPMENT STATEMENT Web page de-duplication module is an important part of search engine system, which can improve its performance and quality with filtering the web pages downloaded by crawler system of search engine and eliminating the duplicated web pages. Along with the development of Internet, search engine is playing an increasingly important role while netizens have more requirements for search techniques. However, reshipment of important news and other information among different websites cause a large number of duplicated web pages in retrieval results. 2.5 MODELS AND ALGORITHMS FOR DUPLICATE DOCUMENT DETECTION As information management and networking technologies continue to proliferate, document image databases are growing rapidly in size and importance. A key problem facing such systems is determining whether duplicates already exist in the database when a new document arrives. This is challenging both because of the various ways a document can become degraded and because of the many possible interpretations of what it means to be a duplicate. 2.6 FINDING NEAR REPLICAS OF WEB PAGES BASED ON FOURIER TRANSFORM Removing duplicated Web pages can improve the searching accuracy and reduce the data storage space. In this paper each character was mapped into a semantic value by Karhunen-Loeve (K-L) transform of the relationship matrix, and then each document was transformed into a series of discrete values. By Fourier transform of the series each Web page was expressed as several Fourier coefficients, and then the similarity between two Web pages was calculated based on the Fourier coefficients. Experiment results show that this method can find similar Web pages efficiently. III. SYSTEM ANALYSIS 3.1 PROLEM DEFINITION We view URLs as strings over an alphabet Σ of tokens. Tokens are either alphanumeric strings or nonalphanumeric characters. In addition, we require every URL to start with the special token ^ and to end with the special token $ (^ and $ are not included in Σ). For example, the URL is represented by the following sequence of 15 tokens: ˆ,http,:,/,/,www,.,site,.,com,/,index,.,html,$. We denote by U the space of all possible URLs. A URL u is valid, if its domain name resolves to a valid IP address and its contents can be fetched by accessing the corresponding web server (the http return code is not in the 4xx or 5xx series). If u is valid, we denote by doc (u) the returned document1. DUST: Two valid URLs u1, u2 are called dust, if their corresponding documents, doc (u1) and doc (u2), are similar. DUST RULES: In this thesis, we seek general rules for detecting when two URLs are dust. A dust rule φ is a relation over the space of URLs. φ may be a many-to-many relation. Every pair of URLs belonging to φ is called an instance of φ. The support of a φ, denoted support (φ), is the collection of all its instances. 3.2 PROPOSED SYSTEM This proposed technique uses the naval URL reduplication technique on URL data and also priorities user results based upon their geo-social data.url deduplication is performed through map reduce algorithm that removes duplicate data based on eliminating same type and same set of data on particular group of URL data. Geo-social data are mined through LDA CRATS technique to find the type and which data user is searching for. This makes the result more accurate than the previous suggested works because all previous works are based on user base click and frequent mining technique which is not performing good at hadoop architectures due to it inability to process on large datasets. 53 Copyright Vandana Publications. All Rights Reserved.

3 MERITS: Web-crawler performance will be increased since it uses map reduce to remove duplicate URLs for mining More sufficient response time for user to get accurate results on their queries. Multiple reduction of alignment reduces the mapping processes between the each nodes with high impact and edge weight of the arbitrary density point that gives better results for the users. IV. METHODOLOGY ARCHITECTURE DIAGRAM 4.1 URL WEB DATA CATAGORIZING URL web data categorizing is the process of categorizing URL data based upon which type of data they provide and which type of website it is. For example if we Facebook, twitter, Orkut are social media websites and categorized a and Snap deal, Amazon, flip karts are online business websites and categorized as business websites.this type of categorizations are performed for ease access of data on the URL clusters. So that we can make it simple and reduce time on searching URL and large cluster of data. 4.2 APPLYING MAP REDUCE MapReduce is a popular data processing paradigm for efficient and fault tolerant workload distribution in large clusters. A MapReduce computation has two phases, namely, the Map phase and the Reduce phase. The Map phase splits an input data into a large number of fragments, which are evenly distributed to Map tasks across a cluster of nodes to process. Each Map task takes in a key-value pair and then generates a set of intermediate key-value pairs. After the MapReduce runtime system groups and sorts all the intermediate values associated with the same intermediate key, the runtime system delivers the intermediate values to Reduce tasks. Each Reduce task takes in all intermediate pairs associated with a particular key and emits a final set of key-value pairs. MapReduce applies the main idea of moving computation towards data, scheduling map tasks to the closest nodes where the input data is stored in order to maximize data locality. Hadoop is one of the most popular MapReduce implementations. Both input and output pairs of a MapReduce application are managed by an underlying Hadoop distributed file system. At the heart of HDFS is a single Name Node a master server managing the file system namespace and regulates file accesses. The Hadoop runtime system establishes two processes called Job Tracker and Task Tracker. Job-Tracker is responsible for assigning and scheduling tasks; each Task Tracker handles mappers or reducers assigned by Job Tracker. Map reduce algorithm has three important methods Group Sort Reduce Grouping: Here we create groups on results that we retrieved. Those groups contains type of data we retrieved type of data category. These types are generated into separate groups for their co-works. Sort: Sort process makes which group of data has to be displayed on top. Reduce: This process removes the dust i.e duplicate URL present in data groups. 4.3 APPLYING LDA CRATS DATA CRATS is that jointly mines the latent Communities, Regions, Activities, Topics, and Sentiments based on the important dependencies among these latent variables. We apply those mined data on our results to produce more accurate and needed results on URL data that retrieved. 4.4 REMOVING DUST USING GEO-SOCIAL DATA Here we remove the unwanted results based upon the geo social data that has been got from the above CRATS data. For example if a user searches data from one particular geo location output data will search for results that appropriate to that particular location and priorities those set of URLs to be displayed on top order of the results and other valid keys such as time, type of data they will need also produces influence the resultant dataset. 4.5 ALGORITHM Input: URL For Keywords query Output: Survived URL sets with user attribute desired key Step 1: Group the URL data based upon their types. 54 Copyright Vandana Publications. All Rights Reserved.

4 Step 2: Sort those group based upon query results. Step 3: Remove duplicate or repeated Urls based upon their Output page results Step 4: Get User data. Step 5: Get Geo-social data of user location. Step 6: Apply geo-social data on resultant data Step 7: Prioritize Url that matches with user geo-social data. Step 8: Finalize the results. Step 9: Return survived URL sets. GOV2 R(Fanout- 10) % R(tree) % Duster % V. RESULT AND DISCUSSIONS We use two document collections in our experiments: GOV2.Dataset consists of a snapshot of their sources fetched from 25,205,179 individual documents from US government domains in According to the TREC track information some duplicate documents have already been removed from GOV2. The GOV2 TREC dataset contains about 3.42 million duplicate URLs divided into about 1.43 million dup-clusters. These documents were Grouped by creating a small fingerprint of their content and hashing the URLs with identical fingerprints into the same clusters is a collection of over 150 million webpages crawled from the Brazilian domain using an actual Brazilian crawling system. This crawling was performed from September to October, 2014, with no restrictions regarding content duplication or quality. To identify groups of duplicate URLs in WBR10, we adopted the same approach used by the authors in [11]. Thus, we scanned the collection to find out the web sites which explicitly indicate the canonical URLs in their pages. By doing this, we identified about 3.95 million duplicate documents in fora total of about 1.14 million dupclusters. Although is six times larger than GOV2, it has only 15 percent more DUST identified. This was expected since webmasters are not obliged to identify canonical URLs. 5.1 Existing Method Data Set Method Candid ates GOV2 R(Fanout- 10) 5.2 OUR METHOD Data Set Valid Rate % R(tree) % Duster % Method Candidat es Valid Rate 5.3 Graph Representation EXISTING METHOD: OUR METHOD These two methods were chosen due to their performance in previous experiments, which indicate they represent the best options found in literature for deduplicating URLs. VI. CONCLUSION Thus this paper work DUST REMOVING MODULES has solved all the problems existed in previous systems. Since this system has log of websites then pairing the websites. Thus the user feels free to use the websites and he can be sure that his credentials have been protected. As the system gives the opportunity to change his websites. User view the original websites. The system is simple and user-friendly and they can avail the services easily. VII. FUTURE WORK 55 Copyright Vandana Publications. All Rights Reserved.

5 In this paper, we discussed the development of storing the log details into the server without any duplication but the timings of the server can take the more time to do the every operation in the database so in future work we have to reduce the loading timings and efficient in the server db. as well as we have to take the server log in country wise also because it is global server log maintain we follow this algorithm in the every country server. SIGKDD Int. Conf. Knowl. Discov- ery Data Mining, 2008, pp [14] D. F. Feng and R. F. Doolittle. (1987). Progressive sequence align- ment as a prerequisite to correct phylogenetic trees. J. Mol. Evol. [Online]. 25(4), pp Available: gov/pubmed/ REFERENCES [1] S. Abiteboul, M. Preda, and G. Cobena. Adaptive online page importance computation. In WWW '03:Proceedings of the 12th international conference on World Wide Web, pages 280{290, May [2] Z. Bar-Yossef, I. Keidar, and U. Schonfeld. Do not crawl in the dust: di erent urls with similar text. In WWW '07: Proceedings of the 16th international conference on World Wide Web, pages 111{120, May [3] T. Berners-Lee, L. Masinter, and M. McCahill. Uniform resource locators (url), [4] Z. Bar-Yossef, I. Keidar, and U. Schonfeld. Do not crawl in the DUST: different URLs with similar text. TechnicalReport CCIT Report #601, Dept. Electrical Engineering, Technion, [5] K. Bharat and A. Z. Broder. Mirror, Mirror on the Web: A Study of Host Pairs with Replicated Content. Computer Networks, , [6] Daniel P.Lopresti et al., Models and Algorithms for Duplicate Document Detection, Fifth International Conference on Document Analysis and Recognition, Bangalore, India, September [7] CHEN jin-yan et al., Finding near replicas of Web pages based on Fourier transform, ComputerApp lications(28:4),2008,p [8] A. Agarwal, H. S. Koppula, K. P. Leela, K. P. Chitrapura, S. Garg, P. Kumar GM, C. Haty, A. Roy, and A.Sasturkar, Url normaliza- tion for de-duplication of web pages, in Proc. 18th ACM Conf. Inf. knowl. Manage., 2009, pp [9] B. S. Alsulami, M. F. Abulkhair, and F. E. Eassa, Near duplicate document detection survey, Int. J. Comput. Sci. Commun. Netw., vol. 2, no. 2, pp , [10] Z. Bar-Yossef, I. Keidar, and U. Schonfeld, Do not crawl in the dust: Different urls with similar text, ACM Trans. Web, vol. 3, no. 1, pp. 3:1 3:31, Jan [11] G. Blackshields, F. Sievers, W. Shi, A. Wilm, and D. G. Higgins. (2010). Sequence embedding for fast construction of guide trees for multiple sequence alignment. Algorithms Mol. Biol. [Online]. 5, p. 21. Available: ide/ html [12] C. L. A. Clarke, N. Craswell, and I. Soboroff, Overview of the TREC 2004 terabyte track, in Proc. 13th Text Retrieval Conf., 2004, pp [13] A. Dasgupta, R. Kumar, and A. Sasturkar, Deduping urls via rewrite rules, in Proc. 14th ACM 56 Copyright Vandana Publications. All Rights Reserved.

Detection of Distinct URL and Removing DUST Using Multiple Alignments of Sequences

Detection of Distinct URL and Removing DUST Using Multiple Alignments of Sequences Detection of Distinct URL and Removing DUST Using Multiple Alignments of Sequences Prof. Sandhya Shinde 1, Ms. Rutuja Bidkar 2,Ms. Nisha Deore 3, Ms. Nikita Salunke 4, Ms. Neelay Shivsharan 5 1 Professor,

More information

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 12: Distributed Information Retrieval CS 347 Notes 12 2 CS 347 Notes 12 3 CS 347 Notes 12 4 Web Search Engine Crawling Indexing Computing

More information

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 12: Distributed Information Retrieval CS 347 Notes 12 2 CS 347 Notes 12 3 CS 347 Notes 12 4 CS 347 Notes 12 5 Web Search Engine Crawling

More information

Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management

Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management Kranti Patil 1, Jayashree Fegade 2, Diksha Chiramade 3, Srujan Patil 4, Pradnya A. Vikhar 5 1,2,3,4,5 KCES

More information

Crawler with Search Engine based Simple Web Application System for Forum Mining

Crawler with Search Engine based Simple Web Application System for Forum Mining IJSRD - International Journal for Scientific Research & Development Vol. 3, Issue 04, 2015 ISSN (online): 2321-0613 Crawler with Search Engine based Simple Web Application System for Forum Mining Parina

More information

Automated Path Ascend Forum Crawling

Automated Path Ascend Forum Crawling Automated Path Ascend Forum Crawling Ms. Joycy Joy, PG Scholar Department of CSE, Saveetha Engineering College,Thandalam, Chennai-602105 Ms. Manju. A, Assistant Professor, Department of CSE, Saveetha Engineering

More information

A Supervised Method for Multi-keyword Web Crawling on Web Forums

A Supervised Method for Multi-keyword Web Crawling on Web Forums Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 2, February 2014,

More information

Data Partitioning Method for Mining Frequent Itemset Using MapReduce

Data Partitioning Method for Mining Frequent Itemset Using MapReduce 1st International Conference on Applied Soft Computing Techniques 22 & 23.04.2017 In association with International Journal of Scientific Research in Science and Technology Data Partitioning Method for

More information

ADVANCED LEARNING TO WEB FORUM CRAWLING

ADVANCED LEARNING TO WEB FORUM CRAWLING ADVANCED LEARNING TO WEB FORUM CRAWLING 1 PATAN RIZWAN, 2 R.VINOD KUMAR Audisankara College of Engineering and Technology Gudur,prizwan5@gmail.com, Asst. Professor,Audisankara College of Engineering and

More information

Big Data Using Hadoop

Big Data Using Hadoop IEEE 2016-17 PROJECT LIST(JAVA) Big Data Using Hadoop 17ANSP-BD-001 17ANSP-BD-002 Hadoop Performance Modeling for JobEstimation and Resource Provisioning MapReduce has become a major computing model for

More information

Mitigating Data Skew Using Map Reduce Application

Mitigating Data Skew Using Map Reduce Application Ms. Archana P.M Mitigating Data Skew Using Map Reduce Application Mr. Malathesh S.H 4 th sem, M.Tech (C.S.E) Associate Professor C.S.E Dept. M.S.E.C, V.T.U Bangalore, India archanaanil062@gmail.com M.S.E.C,

More information

Automation of URL Discovery and Flattering Mechanism in Live Forum Threads

Automation of URL Discovery and Flattering Mechanism in Live Forum Threads Automation of URL Discovery and Flattering Mechanism in Live Forum Threads T.Nagajothi 1, M.S.Thanabal 2 PG Student, Department of CSE, P.S.N.A College of Engineering and Technology, Tamilnadu, India 1

More information

Mining Distributed Frequent Itemset with Hadoop

Mining Distributed Frequent Itemset with Hadoop Mining Distributed Frequent Itemset with Hadoop Ms. Poonam Modgi, PG student, Parul Institute of Technology, GTU. Prof. Dinesh Vaghela, Parul Institute of Technology, GTU. Abstract: In the current scenario

More information

A Software Architecture for Progressive Scanning of On-line Communities

A Software Architecture for Progressive Scanning of On-line Communities A Software Architecture for Progressive Scanning of On-line Communities Roberto Baldoni, Fabrizio d Amore, Massimo Mecella, Daniele Ucci Sapienza Università di Roma, Italy Motivations On-line communities

More information

ABSTRACT I. INTRODUCTION

ABSTRACT I. INTRODUCTION International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISS: 2456-3307 Hadoop Periodic Jobs Using Data Blocks to Achieve

More information

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department

More information

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16 MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti 1 Department

More information

Research Article Apriori Association Rule Algorithms using VMware Environment

Research Article Apriori Association Rule Algorithms using VMware Environment Research Journal of Applied Sciences, Engineering and Technology 8(2): 16-166, 214 DOI:1.1926/rjaset.8.955 ISSN: 24-7459; e-issn: 24-7467 214 Maxwell Scientific Publication Corp. Submitted: January 2,

More information

Classification of Page to the aspect of Crawl Web Forum and URL Navigation

Classification of Page to the aspect of Crawl Web Forum and URL Navigation Classification of Page to the aspect of Crawl Web Forum and URL Navigation Yerragunta Kartheek*1, T.Sunitha Rani*2 M.Tech Scholar, Dept of CSE, QISCET, ONGOLE, Dist: Prakasam, AP, India. Associate Professor,

More information

Efficient Algorithm for Frequent Itemset Generation in Big Data

Efficient Algorithm for Frequent Itemset Generation in Big Data Efficient Algorithm for Frequent Itemset Generation in Big Data Anbumalar Smilin V, Siddique Ibrahim S.P, Dr.M.Sivabalakrishnan P.G. Student, Department of Computer Science and Engineering, Kumaraguru

More information

RECORD DEDUPLICATION USING GENETIC PROGRAMMING APPROACH

RECORD DEDUPLICATION USING GENETIC PROGRAMMING APPROACH Int. J. Engg. Res. & Sci. & Tech. 2013 V Karthika et al., 2013 Research Paper ISSN 2319-5991 www.ijerst.com Vol. 2, No. 2, May 2013 2013 IJERST. All Rights Reserved RECORD DEDUPLICATION USING GENETIC PROGRAMMING

More information

A Cloud-based Web Crawler Architecture

A Cloud-based Web Crawler Architecture Volume-6, Issue-4, July-August 2016 International Journal of Engineering and Management Research Page Number: 148-152 A Cloud-based Web Crawler Architecture Poonam Maheshwar Management Education Research

More information

REDUNDANCY REMOVAL IN WEB SEARCH RESULTS USING RECURSIVE DUPLICATION CHECK ALGORITHM. Pudukkottai, Tamil Nadu, India

REDUNDANCY REMOVAL IN WEB SEARCH RESULTS USING RECURSIVE DUPLICATION CHECK ALGORITHM. Pudukkottai, Tamil Nadu, India REDUNDANCY REMOVAL IN WEB SEARCH RESULTS USING RECURSIVE DUPLICATION CHECK ALGORITHM Dr. S. RAVICHANDRAN 1 E.ELAKKIYA 2 1 Head, Dept. of Computer Science, H. H. The Rajah s College, Pudukkottai, Tamil

More information

FOCUS: ADAPTING TO CRAWL INTERNET FORUMS

FOCUS: ADAPTING TO CRAWL INTERNET FORUMS FOCUS: ADAPTING TO CRAWL INTERNET FORUMS T.K. Arunprasath, Dr. C. Kumar Charlie Paul Abstract Internet is emergent exponentially and has become progressively more. Now, it is complicated to retrieve relevant

More information

Survey of String Similarity Join Algorithms on Large Scale Data

Survey of String Similarity Join Algorithms on Large Scale Data Survey of String Similarity Join Algorithms on Large Scale Data P.Selvaramalakshmi Research Scholar Dept. of Computer Science Bishop Heber College (Autonomous) Tiruchirappalli, Tamilnadu, India. Dr. S.

More information

High Performance Computing on MapReduce Programming Framework

High Performance Computing on MapReduce Programming Framework International Journal of Private Cloud Computing Environment and Management Vol. 2, No. 1, (2015), pp. 27-32 http://dx.doi.org/10.21742/ijpccem.2015.2.1.04 High Performance Computing on MapReduce Programming

More information

Supervised Web Forum Crawling

Supervised Web Forum Crawling Supervised Web Forum Crawling 1 Priyanka S. Bandagale, 2 Dr. Lata Ragha 1 Student, 2 Professor and HOD 1 Computer Department, 1 Terna college of Engineering, Navi Mumbai, India Abstract - In this paper,

More information

A REVIEW PAPER ON BIG DATA ANALYTICS

A REVIEW PAPER ON BIG DATA ANALYTICS A REVIEW PAPER ON BIG DATA ANALYTICS Kirti Bhatia 1, Lalit 2 1 HOD, Department of Computer Science, SKITM Bahadurgarh Haryana, India bhatia.kirti.it@gmail.com 2 M Tech 4th sem SKITM Bahadurgarh, Haryana,

More information

Improved MapReduce k-means Clustering Algorithm with Combiner

Improved MapReduce k-means Clustering Algorithm with Combiner 2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation Improved MapReduce k-means Clustering Algorithm with Combiner Prajesh P Anchalia Department Of Computer Science and Engineering

More information

Efficient Entity Matching over Multiple Data Sources with MapReduce

Efficient Entity Matching over Multiple Data Sources with MapReduce Efficient Entity Matching over Multiple Data Sources with MapReduce Demetrio Gomes Mestre, Carlos Eduardo Pires Universidade Federal de Campina Grande, Brazil demetriogm@gmail.com, cesp@dsc.ufcg.edu.br

More information

Map-Reduce. Marco Mura 2010 March, 31th

Map-Reduce. Marco Mura 2010 March, 31th Map-Reduce Marco Mura (mura@di.unipi.it) 2010 March, 31th This paper is a note from the 2009-2010 course Strumenti di programmazione per sistemi paralleli e distribuiti and it s based by the lessons of

More information

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

A Comparative study of Clustering Algorithms using MapReduce in Hadoop A Comparative study of Clustering Algorithms using MapReduce in Hadoop Dweepna Garg 1, Khushboo Trivedi 2, B.B.Panchal 3 1 Department of Computer Science and Engineering, Parul Institute of Engineering

More information

Comparative Analysis of Range Aggregate Queries In Big Data Environment

Comparative Analysis of Range Aggregate Queries In Big Data Environment Comparative Analysis of Range Aggregate Queries In Big Data Environment Ranjanee S PG Scholar, Dept. of Computer Science and Engineering, Institute of Road and Transport Technology, Erode, TamilNadu, India.

More information

Programming Models MapReduce

Programming Models MapReduce Programming Models MapReduce Majd Sakr, Garth Gibson, Greg Ganger, Raja Sambasivan 15-719/18-847b Advanced Cloud Computing Fall 2013 Sep 23, 2013 1 MapReduce In a Nutshell MapReduce incorporates two phases

More information

Supporting Fuzzy Keyword Search in Databases

Supporting Fuzzy Keyword Search in Databases I J C T A, 9(24), 2016, pp. 385-391 International Science Press Supporting Fuzzy Keyword Search in Databases Jayavarthini C.* and Priya S. ABSTRACT An efficient keyword search system computes answers as

More information

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop

More information

An Improved Performance Evaluation on Large-Scale Data using MapReduce Technique

An Improved Performance Evaluation on Large-Scale Data using MapReduce Technique Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 6.017 IJCSMC,

More information

A LITERATURE SURVEY ON WEB CRAWLERS

A LITERATURE SURVEY ON WEB CRAWLERS A LITERATURE SURVEY ON WEB CRAWLERS V. Rajapriya School of Computer Science and Engineering, Bharathidasan University, Trichy, India rajpriyavaradharajan@gmail.com ABSTRACT: The web contains large data

More information

Survey Paper on Traditional Hadoop and Pipelined Map Reduce

Survey Paper on Traditional Hadoop and Pipelined Map Reduce International Journal of Computational Engineering Research Vol, 03 Issue, 12 Survey Paper on Traditional Hadoop and Pipelined Map Reduce Dhole Poonam B 1, Gunjal Baisa L 2 1 M.E.ComputerAVCOE, Sangamner,

More information

Data Analysis Using MapReduce in Hadoop Environment

Data Analysis Using MapReduce in Hadoop Environment Data Analysis Using MapReduce in Hadoop Environment Muhammad Khairul Rijal Muhammad*, Saiful Adli Ismail, Mohd Nazri Kama, Othman Mohd Yusop, Azri Azmi Advanced Informatics School (UTM AIS), Universiti

More information

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation Voldemort Smruti R. Sarangi Department of Computer Science Indian Institute of Technology New Delhi, India Smruti R. Sarangi Leader Election 1/29 Outline 1 2 3 Smruti R. Sarangi Leader Election 2/29 Data

More information

EXTRACT DATA IN LARGE DATABASE WITH HADOOP

EXTRACT DATA IN LARGE DATABASE WITH HADOOP International Journal of Advances in Engineering & Scientific Research (IJAESR) ISSN: 2349 3607 (Online), ISSN: 2349 4824 (Print) Download Full paper from : http://www.arseam.com/content/volume-1-issue-7-nov-2014-0

More information

Distributed Bottom up Approach for Data Anonymization using MapReduce framework on Cloud

Distributed Bottom up Approach for Data Anonymization using MapReduce framework on Cloud Distributed Bottom up Approach for Data Anonymization using MapReduce framework on Cloud R. H. Jadhav 1 P.E.S college of Engineering, Aurangabad, Maharashtra, India 1 rjadhav377@gmail.com ABSTRACT: Many

More information

CompSci 516: Database Systems

CompSci 516: Database Systems CompSci 516 Database Systems Lecture 12 Map-Reduce and Spark Instructor: Sudeepa Roy Duke CS, Fall 2017 CompSci 516: Database Systems 1 Announcements Practice midterm posted on sakai First prepare and

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

S. Indirakumari, A. Thilagavathy

S. Indirakumari, A. Thilagavathy International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2017 IJSRCSEIT Volume 2 Issue 2 ISSN : 2456-3307 A Secure Verifiable Storage Deduplication Scheme

More information

How to Evaluate the Effectiveness of URL Normalizations

How to Evaluate the Effectiveness of URL Normalizations How to Evaluate the Effectiveness of URL Normalizations Sang Ho Lee 1, Sung Jin Kim 2, and Hyo Sook Jeong 1 1 School of Computing, Soongsil University, Seoul, Korea shlee@computing.ssu.ac.kr, hsjeong@ssu.ac.kr

More information

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN:

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN: IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, 20131 Improve Search Engine Relevance with Filter session Addlin Shinney R 1, Saravana Kumar T

More information

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems Announcements CompSci 516 Database Systems Lecture 12 - and Spark Practice midterm posted on sakai First prepare and then attempt! Midterm next Wednesday 10/11 in class Closed book/notes, no electronic

More information

Survey on Incremental MapReduce for Data Mining

Survey on Incremental MapReduce for Data Mining Survey on Incremental MapReduce for Data Mining Trupti M. Shinde 1, Prof.S.V.Chobe 2 1 Research Scholar, Computer Engineering Dept., Dr. D. Y. Patil Institute of Engineering &Technology, 2 Associate Professor,

More information

FREQUENT PATTERN MINING IN BIG DATA USING MAVEN PLUGIN. School of Computing, SASTRA University, Thanjavur , India

FREQUENT PATTERN MINING IN BIG DATA USING MAVEN PLUGIN. School of Computing, SASTRA University, Thanjavur , India Volume 115 No. 7 2017, 105-110 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu FREQUENT PATTERN MINING IN BIG DATA USING MAVEN PLUGIN Balaji.N 1,

More information

Improving Efficiency of Parallel Mining of Frequent Itemsets using Fidoop-hd

Improving Efficiency of Parallel Mining of Frequent Itemsets using Fidoop-hd ISSN 2395-1621 Improving Efficiency of Parallel Mining of Frequent Itemsets using Fidoop-hd #1 Anjali Kadam, #2 Nilam Patil 1 mianjalikadam@gmail.com 2 snilampatil2012@gmail.com #12 Department of Computer

More information

The MapReduce Framework

The MapReduce Framework The MapReduce Framework In Partial fulfilment of the requirements for course CMPT 816 Presented by: Ahmed Abdel Moamen Agents Lab Overview MapReduce was firstly introduced by Google on 2004. MapReduce

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 5 Issue 2, Mar Apr 2017

International Journal of Computer Science Trends and Technology (IJCST) Volume 5 Issue 2, Mar Apr 2017 RESEARCH ARTICLE An Efficient Dynamic Slot Allocation Based On Fairness Consideration for MAPREDUCE Clusters T. P. Simi Smirthiga [1], P.Sowmiya [2], C.Vimala [3], Mrs P.Anantha Prabha [4] U.G Scholar

More information

Do Not Crawl in the DUST: Different URLs with Similar Text

Do Not Crawl in the DUST: Different URLs with Similar Text Do Not Crawl in the DUST: Different URLs with Similar Text Ziv Bar-Yossef Idit Keidar Uri Schonfeld Abstract We consider the problem of DUST: Different URLs with Similar Text. Such duplicate URLs are prevalent

More information

Comparison of Online Record Linkage Techniques

Comparison of Online Record Linkage Techniques International Research Journal of Engineering and Technology (IRJET) e-issn: 2395-0056 Volume: 02 Issue: 09 Dec-2015 p-issn: 2395-0072 www.irjet.net Comparison of Online Record Linkage Techniques Ms. SRUTHI.

More information

Introduction to MapReduce (cont.)

Introduction to MapReduce (cont.) Introduction to MapReduce (cont.) Rafael Ferreira da Silva rafsilva@isi.edu http://rafaelsilva.com USC INF 553 Foundations and Applications of Data Mining (Fall 2018) 2 MapReduce: Summary USC INF 553 Foundations

More information

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on

More information

I. INTRODUCTION. Fig Taxonomy of approaches to build specialized search engines, as shown in [80].

I. INTRODUCTION. Fig Taxonomy of approaches to build specialized search engines, as shown in [80]. Focus: Accustom To Crawl Web-Based Forums M.Nikhil 1, Mrs. A.Phani Sheetal 2 1 Student, Department of Computer Science, GITAM University, Hyderabad. 2 Assistant Professor, Department of Computer Science,

More information

An Overview of Projection, Partitioning and Segmentation of Big Data Using Hp Vertica

An Overview of Projection, Partitioning and Segmentation of Big Data Using Hp Vertica IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 19, Issue 5, Ver. I (Sep.- Oct. 2017), PP 48-53 www.iosrjournals.org An Overview of Projection, Partitioning

More information

Template Extraction from Heterogeneous Web Pages

Template Extraction from Heterogeneous Web Pages Template Extraction from Heterogeneous Web Pages 1 Mrs. Harshal H. Kulkarni, 2 Mrs. Manasi k. Kulkarni Asst. Professor, Pune University, (PESMCOE, Pune), Pune, India Abstract: Templates are used by many

More information

Implementation of Aggregation of Map and Reduce Function for Performance Improvisation

Implementation of Aggregation of Map and Reduce Function for Performance Improvisation 2016 IJSRSET Volume 2 Issue 5 Print ISSN: 2395-1990 Online ISSN : 2394-4099 Themed Section: Engineering and Technology Implementation of Aggregation of Map and Reduce Function for Performance Improvisation

More information

CHAPTER 4 ROUND ROBIN PARTITIONING

CHAPTER 4 ROUND ROBIN PARTITIONING 79 CHAPTER 4 ROUND ROBIN PARTITIONING 4.1 INTRODUCTION The Hadoop Distributed File System (HDFS) is constructed to store immensely colossal data sets accurately and to send those data sets at huge bandwidth

More information

Cloud Computing. Hwajung Lee. Key Reference: Prof. Jong-Moon Chung s Lecture Notes at Yonsei University

Cloud Computing. Hwajung Lee. Key Reference: Prof. Jong-Moon Chung s Lecture Notes at Yonsei University Cloud Computing Hwajung Lee Key Reference: Prof. Jong-Moon Chung s Lecture Notes at Yonsei University Cloud Computing Cloud Introduction Cloud Service Model Big Data Hadoop MapReduce HDFS (Hadoop Distributed

More information

A Fast and High Throughput SQL Query System for Big Data

A Fast and High Throughput SQL Query System for Big Data A Fast and High Throughput SQL Query System for Big Data Feng Zhu, Jie Liu, and Lijie Xu Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing, China 100190

More information

ADAPTIVE HANDLING OF 3V S OF BIG DATA TO IMPROVE EFFICIENCY USING HETEROGENEOUS CLUSTERS

ADAPTIVE HANDLING OF 3V S OF BIG DATA TO IMPROVE EFFICIENCY USING HETEROGENEOUS CLUSTERS INTERNATIONAL JOURNAL OF RESEARCH IN COMPUTER APPLICATIONS AND ROBOTICS ISSN 2320-7345 ADAPTIVE HANDLING OF 3V S OF BIG DATA TO IMPROVE EFFICIENCY USING HETEROGENEOUS CLUSTERS Radhakrishnan R 1, Karthik

More information

Research Article Mobile Storage and Search Engine of Information Oriented to Food Cloud

Research Article Mobile Storage and Search Engine of Information Oriented to Food Cloud Advance Journal of Food Science and Technology 5(10): 1331-1336, 2013 DOI:10.19026/ajfst.5.3106 ISSN: 2042-4868; e-issn: 2042-4876 2013 Maxwell Scientific Publication Corp. Submitted: May 29, 2013 Accepted:

More information

Developing MapReduce Programs

Developing MapReduce Programs Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2017/18 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

More information

Information Retrieval

Information Retrieval Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,

More information

Relevant?!? Algoritmi per IR. Goal of a Search Engine. Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Web Search

Relevant?!? Algoritmi per IR. Goal of a Search Engine. Prof. Paolo Ferragina, Algoritmi per Information Retrieval Web Search Algoritmi per IR Web Search Goal of a Search Engine Retrieve docs that are relevant for the user query Doc: file word or pdf, web page, email, blog, e-book,... Query: paradigm bag of words Relevant?!?

More information

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The

More information

EaSync: A Transparent File Synchronization Service across Multiple Machines

EaSync: A Transparent File Synchronization Service across Multiple Machines EaSync: A Transparent File Synchronization Service across Multiple Machines Huajian Mao 1,2, Hang Zhang 1,2, Xianqiang Bao 1,2, Nong Xiao 1,2, Weisong Shi 3, and Yutong Lu 1,2 1 State Key Laboratory of

More information

Evolving To The Big Data Warehouse

Evolving To The Big Data Warehouse Evolving To The Big Data Warehouse Kevin Lancaster 1 Copyright Director, 2012, Oracle and/or its Engineered affiliates. All rights Insert Systems, Information Protection Policy Oracle Classification from

More information

A Study on Reverse Top-K Queries Using Monochromatic and Bichromatic Methods

A Study on Reverse Top-K Queries Using Monochromatic and Bichromatic Methods A Study on Reverse Top-K Queries Using Monochromatic and Bichromatic Methods S.Anusuya 1, M.Balaganesh 2 P.G. Student, Department of Computer Science and Engineering, Sembodai Rukmani Varatharajan Engineering

More information

Mounica B, Aditya Srivastava, Md. Faisal Alam

Mounica B, Aditya Srivastava, Md. Faisal Alam International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2017 IJSRCSEIT Volume 2 Issue 3 ISSN : 2456-3307 Clustering of large datasets using Hadoop Ecosystem

More information

Batch Inherence of Map Reduce Framework

Batch Inherence of Map Reduce Framework Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 6, June 2015, pg.287

More information

Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing Environments

Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing Environments Send Orders for Reprints to reprints@benthamscience.ae 368 The Open Automation and Control Systems Journal, 2014, 6, 368-373 Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing

More information

A New Technique to Optimize User s Browsing Session using Data Mining

A New Technique to Optimize User s Browsing Session using Data Mining Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,

More information

The amount of data increases every day Some numbers ( 2012):

The amount of data increases every day Some numbers ( 2012): 1 The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect

More information

A Micro Partitioning Technique in MapReduce for Massive Data Analysis

A Micro Partitioning Technique in MapReduce for Massive Data Analysis A Micro Partitioning Technique in MapReduce for Massive Data Analysis Nandhini.C, Premadevi.P PG Scholar, Dept. of CSE, Angel College of Engg and Tech, Tiruppur, Tamil Nadu Assistant Professor, Dept. of

More information

2/26/2017. The amount of data increases every day Some numbers ( 2012):

2/26/2017. The amount of data increases every day Some numbers ( 2012): The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect to

More information

CS47300: Web Information Search and Management

CS47300: Web Information Search and Management CS47300: Web Information Search and Management Web Search Prof. Chris Clifton 18 October 2017 Some slides courtesy Croft et al. Web Crawler Finds and downloads web pages automatically provides the collection

More information

MapReduce for Data Intensive Scientific Analyses

MapReduce for Data Intensive Scientific Analyses apreduce for Data Intensive Scientific Analyses Jaliya Ekanayake Shrideep Pallickara Geoffrey Fox Department of Computer Science Indiana University Bloomington, IN, 47405 5/11/2009 Jaliya Ekanayake 1 Presentation

More information

Department of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang

Department of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang Department of Computer Science San Marcos, TX 78666 Report Number TXSTATE-CS-TR-2010-24 Clustering in the Cloud Xuan Wang 2010-05-05 !"#$%&'()*+()+%,&+!"-#. + /+!"#$%&'()*+0"*-'(%,1$+0.23%(-)+%-+42.--3+52367&.#8&+9'21&:-';

More information

Survey on MapReduce Scheduling Algorithms

Survey on MapReduce Scheduling Algorithms Survey on MapReduce Scheduling Algorithms Liya Thomas, Mtech Student, Department of CSE, SCTCE,TVM Syama R, Assistant Professor Department of CSE, SCTCE,TVM ABSTRACT MapReduce is a programming model used

More information

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?

More information

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing Big Data Analytics Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 Big Data "The world is crazy. But at least it s getting regular analysis." Izabela

More information

Data Matching and Deduplication Over Big Data Using Hadoop Framework

Data Matching and Deduplication Over Big Data Using Hadoop Framework Data Matching and Deduplication Over Big Data Using Hadoop Framework Pablo Adrián Albanese, Juan M. Ale palbanese@fi.uba.ar ale@acm.org Facultad de Ingeniería, UBA Abstract. Entity Resolution is the process

More information

Improving Latent Fingerprint Matching Performance by Orientation Field Estimation using Localized Dictionaries

Improving Latent Fingerprint Matching Performance by Orientation Field Estimation using Localized Dictionaries Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 11, November 2014,

More information

Clean Living: Eliminating Near-Duplicates in Lifetime Personal Storage

Clean Living: Eliminating Near-Duplicates in Lifetime Personal Storage Clean Living: Eliminating Near-Duplicates in Lifetime Personal Storage Zhe Wang Princeton University Jim Gemmell Microsoft Research September 2005 Technical Report MSR-TR-2006-30 Microsoft Research Microsoft

More information

DATA DEDUPLCATION AND MIGRATION USING LOAD REBALANCING APPROACH IN HDFS Pritee Patil 1, Nitin Pise 2,Sarika Bobde 3 1

DATA DEDUPLCATION AND MIGRATION USING LOAD REBALANCING APPROACH IN HDFS Pritee Patil 1, Nitin Pise 2,Sarika Bobde 3 1 DATA DEDUPLCATION AND MIGRATION USING LOAD REBALANCING APPROACH IN HDFS Pritee Patil 1, Nitin Pise 2,Sarika Bobde 3 1 Department of Computer Engineering 2 Department of Computer Engineering Maharashtra

More information

Comprehensive and Progressive Duplicate Entities Detection

Comprehensive and Progressive Duplicate Entities Detection Comprehensive and Progressive Duplicate Entities Detection Veerisetty Ravi Kumar Dept of CSE, Benaiah Institute of Technology and Science. Nagaraju Medida Assistant Professor, Benaiah Institute of Technology

More information

Global Journal of Engineering Science and Research Management

Global Journal of Engineering Science and Research Management A FUNDAMENTAL CONCEPT OF MAPREDUCE WITH MASSIVE FILES DATASET IN BIG DATA USING HADOOP PSEUDO-DISTRIBUTION MODE K. Srikanth*, P. Venkateswarlu, Ashok Suragala * Department of Information Technology, JNTUK-UCEV

More information

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm Announcements HW5 is due tomorrow 11pm Database Systems CSE 414 Lecture 19: MapReduce (Ch. 20.2) HW6 is posted and due Nov. 27 11pm Section Thursday on setting up Spark on AWS Create your AWS account before

More information

Modelling Structures in Data Mining Techniques

Modelling Structures in Data Mining Techniques Modelling Structures in Data Mining Techniques Ananth Y N 1, Narahari.N.S 2 Associate Professor, Dept of Computer Science, School of Graduate Studies- JainUniversity- J.C.Road, Bangalore, INDIA 1 Professor

More information

MI-PDB, MIE-PDB: Advanced Database Systems

MI-PDB, MIE-PDB: Advanced Database Systems MI-PDB, MIE-PDB: Advanced Database Systems http://www.ksi.mff.cuni.cz/~svoboda/courses/2015-2-mie-pdb/ Lecture 10: MapReduce, Hadoop 26. 4. 2016 Lecturer: Martin Svoboda svoboda@ksi.mff.cuni.cz Author:

More information

Obtaining Rough Set Approximation using MapReduce Technique in Data Mining

Obtaining Rough Set Approximation using MapReduce Technique in Data Mining Obtaining Rough Set Approximation using MapReduce Technique in Data Mining Varda Dhande 1, Dr. B. K. Sarkar 2 1 M.E II yr student, Dept of Computer Engg, P.V.P.I.T Collage of Engineering Pune, Maharashtra,

More information

CS 61C: Great Ideas in Computer Architecture. MapReduce

CS 61C: Great Ideas in Computer Architecture. MapReduce CS 61C: Great Ideas in Computer Architecture MapReduce Guest Lecturer: Justin Hsia 3/06/2013 Spring 2013 Lecture #18 1 Review of Last Lecture Performance latency and throughput Warehouse Scale Computing

More information

Iteration Reduction K Means Clustering Algorithm

Iteration Reduction K Means Clustering Algorithm Iteration Reduction K Means Clustering Algorithm Kedar Sawant 1 and Snehal Bhogan 2 1 Department of Computer Engineering, Agnel Institute of Technology and Design, Assagao, Goa 403507, India 2 Department

More information

Search Engines. Information Retrieval in Practice

Search Engines. Information Retrieval in Practice Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Web Crawler Finds and downloads web pages automatically provides the collection for searching Web is huge and constantly

More information