Near Duplicate URL Detection for Removing Dust Unique Key

Size: px

Start display at page:

Download "Near Duplicate URL Detection for Removing Dust Unique Key"

Bennett Greene
5 years ago
Views:

Volume-7, Issue-5, September-October 2017 International Journal of Engineering and Management Research Page Number: 52-56 Near Duplicate URL Detection for Removing Dust Unique Key R.

1 Volume-7, Issue-5, September-October 2017 International Journal of Engineering and Management Research Page Number: Near Duplicate URL Detection for Removing Dust Unique Key R.Vijaya Santhi Research Scholar, Department of Computer Science, Tamil University College, Thanjavur, Tamil Nadu, INDIA ABSTRACT Regular parallel mining algorithms for mining frequent item sets intends to balance load by equally partitioning data among a group of computing nodes. But those existing parallel Frequent Item set Mining algorithms has serious performance issues. In big data environment existing mining algorithm suffer high communication and mining overhead induced by redundant data transmitted among computing nodes. We explore this problem by developing a data partitioning approach using the MapReduce programming model. The aim of this paper is to enhance the performance of parallel Frequent Item set mining on Hadoop clusters. Incorporating the similarity metric and the Locality-Sensitive Hashing technique, in this proposed model VUK (Valid Unique Key) DUST removing technique LDA-CRATS mining data is used to run this approach. This approach is to derive quality rules that take advantage of a multi-sequence alignment strategy. It demonstrates that a full multi-sequence alignment of URLs with duplicated content, before the generation of the rules, can lead to the deployment of very effective rules. By evaluating this method, it observed it achieved larger reductions in the number of duplicate URLs than our best baseline, with gains of 85 to percent in two different web collections. Keywords-- MapReduce, URL duplicated, multisequence, Hadoop I. INTRODUCTION The dust problem is that: The web is abundant with dust, different URLs with Similar Text. For example, the URLs and return similar content. A single web server often has multiple DNS names, and any can be typed in the URL. Many are artifacts of a particular web server implementation. For example, URLs of dynamically generated pages often include parameters; which parameters impact the page s content is up to the software that generates the pages. Some sites use their own conventions; for example, a forum site we studied allows accessing story number num both via the URL and via num. Our study of the CNN web site has discovered that URLs of the form get redirected to Universal rules, such as adding or removing a trailing slash are used, in order to obtain some level of canonization. By knowing dust rules, one can dramatically reduce the overhead of this process. But how can one learn about site-specific dust rules? Detecting dust from a URL list. Most of our work therefore focuses on substring substitution rules, which are similar to the replace function in many editors. Dust Buster uses three heuristics, which together are very effective at detecting likely dust rules and distinguishing them from false rules. The first heuristic is based on the observation that if a rule α β is common in a web site, then we can expect to find in the URL list multiple examples of pages accessed both ways. For example, in the site where story? 52 Copyright Vandana Publications. All Rights Reserved. II. LITERATURE SURVEY 2.1 URL NORMALIZATION FOR DE- DUPLICATION OF WEB PAGES Presence of duplicate documents in the World Wide Web adaversely abets crawling, indexing and relevance, which are the core building blocks of web search. In this paper, we present a set of techniques to mine rules from URLs and utilize these learnt rules for de-duplication using just URL Strings without fetching the content explicitly. Our technique is composed of mining the crawl logs and utilizing clusters of similar pages to extract specific rules from URLs belonging to each cluster. Preserving each mined rules for deduplication is not ancient due to the large number of specific rules. We present a machine learning technique to generalize the set of rules, which reduces the resource foot-print to be usable at web-scale. The rule extraction tech-inquest is robust against web-site specific URL conventions. We demonstrate the electiveness of our techniques through Experimental evaluation. 2.2 DO NOT CRAWL IN THE DUST: DIFFERENT URLS WITH SIMILAR TEXT We focus on URLs with similar contents rather than identical Ones, since different versions of the same

2 document are not always identical; they tend to differ in insignificant ways, e.g., counters, dates, and advertisements. Likewise, some URL parameters impact only the way pages are displayed (Fonts, image sizes, etc.) without altering their contents. Detecting DUST from a URL list. Contrary to initial intuition, we show that it is possible to discover likely dust rules without fetching a single web page. We present an algorithm, Dust Buster, which discovers such likely rules from a list of URLs. Such a URL list can be obtained from many sources including a previous crawl or web server logs.1 The rules are then verified (or refuted) by sampling a small number of actual web pages. At first glance, it is not clear that a URL list can provide reliable information regarding dust, as it does not include actual page contents. 2.3 THE TREC 2006 TERABYTE TRACK The primary goal of the Terabyte Track is to develop an evaluation methodology for terabyte-scale document collections. In addition, we are interested in efficiency and scalability issues, which can be studied more easily in the context of a larger collection. TREC 2006 is the third year for the track. The track was introduced as part of TREC 2004, with a single adhoc retrieval task. For TREC 2005, the track was expanded with two optional tasks: a named page finding task and an efficiency task. These three tasks were continued in 2006, with 20groups submitting runs to the adhoc retrieval task, 11 groups submitting runs to the named page finding task, and 8 groups submitting runs to the efficiency task. This report provides an overview of each task, summarizes the results, and outlines directions for the future. Further background information on the development of the track can be found in the 2004 and 2005 track reports. 2.4 THE RESEARCH OF WEB PAGE DE- DUPLICATION BASED ON WEB PAGES RESHIPMENT STATEMENT Web page de-duplication module is an important part of search engine system, which can improve its performance and quality with filtering the web pages downloaded by crawler system of search engine and eliminating the duplicated web pages. Along with the development of Internet, search engine is playing an increasingly important role while netizens have more requirements for search techniques. However, reshipment of important news and other information among different websites cause a large number of duplicated web pages in retrieval results. 2.5 MODELS AND ALGORITHMS FOR DUPLICATE DOCUMENT DETECTION As information management and networking technologies continue to proliferate, document image databases are growing rapidly in size and importance. A key problem facing such systems is determining whether duplicates already exist in the database when a new document arrives. This is challenging both because of the various ways a document can become degraded and because of the many possible interpretations of what it means to be a duplicate. 2.6 FINDING NEAR REPLICAS OF WEB PAGES BASED ON FOURIER TRANSFORM Removing duplicated Web pages can improve the searching accuracy and reduce the data storage space. In this paper each character was mapped into a semantic value by Karhunen-Loeve (K-L) transform of the relationship matrix, and then each document was transformed into a series of discrete values. By Fourier transform of the series each Web page was expressed as several Fourier coefficients, and then the similarity between two Web pages was calculated based on the Fourier coefficients. Experiment results show that this method can find similar Web pages efficiently. III. SYSTEM ANALYSIS 3.1 PROLEM DEFINITION We view URLs as strings over an alphabet Σ of tokens. Tokens are either alphanumeric strings or nonalphanumeric characters. In addition, we require every URL to start with the special token ^ and to end with the special token $ (^ and $ are not included in Σ). For example, the URL is represented by the following sequence of 15 tokens: ˆ,http,:,/,/,www,.,site,.,com,/,index,.,html,$. We denote by U the space of all possible URLs. A URL u is valid, if its domain name resolves to a valid IP address and its contents can be fetched by accessing the corresponding web server (the http return code is not in the 4xx or 5xx series). If u is valid, we denote by doc (u) the returned document1. DUST: Two valid URLs u1, u2 are called dust, if their corresponding documents, doc (u1) and doc (u2), are similar. DUST RULES: In this thesis, we seek general rules for detecting when two URLs are dust. A dust rule φ is a relation over the space of URLs. φ may be a many-to-many relation. Every pair of URLs belonging to φ is called an instance of φ. The support of a φ, denoted support (φ), is the collection of all its instances. 3.2 PROPOSED SYSTEM This proposed technique uses the naval URL reduplication technique on URL data and also priorities user results based upon their geo-social data.url deduplication is performed through map reduce algorithm that removes duplicate data based on eliminating same type and same set of data on particular group of URL data. Geo-social data are mined through LDA CRATS technique to find the type and which data user is searching for. This makes the result more accurate than the previous suggested works because all previous works are based on user base click and frequent mining technique which is not performing good at hadoop architectures due to it inability to process on large datasets. 53 Copyright Vandana Publications. All Rights Reserved.

MERITS: Web-crawler performance will be increased since it uses map reduce to remove duplicate URLs for mining More sufficient response time for user to get accurate results on their queries.

3 MERITS: Web-crawler performance will be increased since it uses map reduce to remove duplicate URLs for mining More sufficient response time for user to get accurate results on their queries. Multiple reduction of alignment reduces the mapping processes between the each nodes with high impact and edge weight of the arbitrary density point that gives better results for the users. IV. METHODOLOGY ARCHITECTURE DIAGRAM 4.1 URL WEB DATA CATAGORIZING URL web data categorizing is the process of categorizing URL data based upon which type of data they provide and which type of website it is. For example if we Facebook, twitter, Orkut are social media websites and categorized a and Snap deal, Amazon, flip karts are online business websites and categorized as business websites.this type of categorizations are performed for ease access of data on the URL clusters. So that we can make it simple and reduce time on searching URL and large cluster of data. 4.2 APPLYING MAP REDUCE MapReduce is a popular data processing paradigm for efficient and fault tolerant workload distribution in large clusters. A MapReduce computation has two phases, namely, the Map phase and the Reduce phase. The Map phase splits an input data into a large number of fragments, which are evenly distributed to Map tasks across a cluster of nodes to process. Each Map task takes in a key-value pair and then generates a set of intermediate key-value pairs. After the MapReduce runtime system groups and sorts all the intermediate values associated with the same intermediate key, the runtime system delivers the intermediate values to Reduce tasks. Each Reduce task takes in all intermediate pairs associated with a particular key and emits a final set of key-value pairs. MapReduce applies the main idea of moving computation towards data, scheduling map tasks to the closest nodes where the input data is stored in order to maximize data locality. Hadoop is one of the most popular MapReduce implementations. Both input and output pairs of a MapReduce application are managed by an underlying Hadoop distributed file system. At the heart of HDFS is a single Name Node a master server managing the file system namespace and regulates file accesses. The Hadoop runtime system establishes two processes called Job Tracker and Task Tracker. Job-Tracker is responsible for assigning and scheduling tasks; each Task Tracker handles mappers or reducers assigned by Job Tracker. Map reduce algorithm has three important methods Group Sort Reduce Grouping: Here we create groups on results that we retrieved. Those groups contains type of data we retrieved type of data category. These types are generated into separate groups for their co-works. Sort: Sort process makes which group of data has to be displayed on top. Reduce: This process removes the dust i.e duplicate URL present in data groups. 4.3 APPLYING LDA CRATS DATA CRATS is that jointly mines the latent Communities, Regions, Activities, Topics, and Sentiments based on the important dependencies among these latent variables. We apply those mined data on our results to produce more accurate and needed results on URL data that retrieved. 4.4 REMOVING DUST USING GEO-SOCIAL DATA Here we remove the unwanted results based upon the geo social data that has been got from the above CRATS data. For example if a user searches data from one particular geo location output data will search for results that appropriate to that particular location and priorities those set of URLs to be displayed on top order of the results and other valid keys such as time, type of data they will need also produces influence the resultant dataset. 4.5 ALGORITHM Input: URL For Keywords query Output: Survived URL sets with user attribute desired key Step 1: Group the URL data based upon their types. 54 Copyright Vandana Publications. All Rights Reserved.

4 Step 2: Sort those group based upon query results. Step 3: Remove duplicate or repeated Urls based upon their Output page results Step 4: Get User data. Step 5: Get Geo-social data of user location. Step 6: Apply geo-social data on resultant data Step 7: Prioritize Url that matches with user geo-social data. Step 8: Finalize the results. Step 9: Return survived URL sets. GOV2 R(Fanout- 10) % R(tree) % Duster % V. RESULT AND DISCUSSIONS We use two document collections in our experiments: GOV2.Dataset consists of a snapshot of their sources fetched from 25,205,179 individual documents from US government domains in According to the TREC track information some duplicate documents have already been removed from GOV2. The GOV2 TREC dataset contains about 3.42 million duplicate URLs divided into about 1.43 million dup-clusters. These documents were Grouped by creating a small fingerprint of their content and hashing the URLs with identical fingerprints into the same clusters is a collection of over 150 million webpages crawled from the Brazilian domain using an actual Brazilian crawling system. This crawling was performed from September to October, 2014, with no restrictions regarding content duplication or quality. To identify groups of duplicate URLs in WBR10, we adopted the same approach used by the authors in [11]. Thus, we scanned the collection to find out the web sites which explicitly indicate the canonical URLs in their pages. By doing this, we identified about 3.95 million duplicate documents in fora total of about 1.14 million dupclusters. Although is six times larger than GOV2, it has only 15 percent more DUST identified. This was expected since webmasters are not obliged to identify canonical URLs. 5.1 Existing Method Data Set Method Candid ates GOV2 R(Fanout- 10) 5.2 OUR METHOD Data Set Valid Rate % R(tree) % Duster % Method Candidat es Valid Rate 5.3 Graph Representation EXISTING METHOD: OUR METHOD These two methods were chosen due to their performance in previous experiments, which indicate they represent the best options found in literature for deduplicating URLs. VI. CONCLUSION Thus this paper work DUST REMOVING MODULES has solved all the problems existed in previous systems. Since this system has log of websites then pairing the websites. Thus the user feels free to use the websites and he can be sure that his credentials have been protected. As the system gives the opportunity to change his websites. User view the original websites. The system is simple and user-friendly and they can avail the services easily. VII. FUTURE WORK 55 Copyright Vandana Publications. All Rights Reserved.

5 In this paper, we discussed the development of storing the log details into the server without any duplication but the timings of the server can take the more time to do the every operation in the database so in future work we have to reduce the loading timings and efficient in the server db. as well as we have to take the server log in country wise also because it is global server log maintain we follow this algorithm in the every country server. SIGKDD Int. Conf. Knowl. Discov- ery Data Mining, 2008, pp [14] D. F. Feng and R. F. Doolittle. (1987). Progressive sequence align- ment as a prerequisite to correct phylogenetic trees. J. Mol. Evol. [Online]. 25(4), pp Available: gov/pubmed/ REFERENCES [1] S. Abiteboul, M. Preda, and G. Cobena. Adaptive online page importance computation. In WWW '03:Proceedings of the 12th international conference on World Wide Web, pages 280{290, May [2] Z. Bar-Yossef, I. Keidar, and U. Schonfeld. Do not crawl in the dust: di erent urls with similar text. In WWW '07: Proceedings of the 16th international conference on World Wide Web, pages 111{120, May [3] T. Berners-Lee, L. Masinter, and M. McCahill. Uniform resource locators (url), [4] Z. Bar-Yossef, I. Keidar, and U. Schonfeld. Do not crawl in the DUST: different URLs with similar text. TechnicalReport CCIT Report #601, Dept. Electrical Engineering, Technion, [5] K. Bharat and A. Z. Broder. Mirror, Mirror on the Web: A Study of Host Pairs with Replicated Content. Computer Networks, , [6] Daniel P.Lopresti et al., Models and Algorithms for Duplicate Document Detection, Fifth International Conference on Document Analysis and Recognition, Bangalore, India, September [7] CHEN jin-yan et al., Finding near replicas of Web pages based on Fourier transform, ComputerApp lications(28:4),2008,p [8] A. Agarwal, H. S. Koppula, K. P. Leela, K. P. Chitrapura, S. Garg, P. Kumar GM, C. Haty, A. Roy, and A.Sasturkar, Url normaliza- tion for de-duplication of web pages, in Proc. 18th ACM Conf. Inf. knowl. Manage., 2009, pp [9] B. S. Alsulami, M. F. Abulkhair, and F. E. Eassa, Near duplicate document detection survey, Int. J. Comput. Sci. Commun. Netw., vol. 2, no. 2, pp , [10] Z. Bar-Yossef, I. Keidar, and U. Schonfeld, Do not crawl in the dust: Different urls with similar text, ACM Trans. Web, vol. 3, no. 1, pp. 3:1 3:31, Jan [11] G. Blackshields, F. Sievers, W. Shi, A. Wilm, and D. G. Higgins. (2010). Sequence embedding for fast construction of guide trees for multiple sequence alignment. Algorithms Mol. Biol. [Online]. 5, p. 21. Available: ide/ html [12] C. L. A. Clarke, N. Craswell, and I. Soboroff, Overview of the TREC 2004 terabyte track, in Proc. 13th Text Retrieval Conf., 2004, pp [13] A. Dasgupta, R. Kumar, and A. Sasturkar, Deduping urls via rewrite rules, in Proc. 14th ACM 56 Copyright Vandana Publications. All Rights Reserved.

Detection of Distinct URL and Removing DUST Using Multiple Alignments of Sequences

Detection of Distinct URL and Removing DUST Using Multiple Alignments of Sequences Prof. Sandhya Shinde 1, Ms. Rutuja Bidkar 2,Ms. Nisha Deore 3, Ms. Nikita Salunke 4, Ms. Neelay Shivsharan 5 1 Professor,