Clone Analysis in the Web Era: an Approach to Identify Cloned Web Pages

Clone Analysis in the Web Era: an Approach to Identify Cloned Web Pages Giuseppe Antonio Di Lucca, Massimiliano Di Penta*, Anna Rita Fasolino, Pasquale Granato dilucca@unina.it, dipenta@unisannio.it, fasolino@unina.it, quale@libero.it ( ) Dipartimento di Informatica e Sistemistica, Università di Napoli Federico II Via Claudio, 21, 80125 Napoli, Italy (*) Università del Sannio Facoltà di Ingegneria Piazza Roma, I-82100 Benevento, Italy Abstract The Internet and World Wide Web diffusion are producing a substantial increase in the demand of web sites and web applications. The very short time-to-market of a web application, and the lack of method for developing it, promote an incremental development fashion where new pages are usually obtained reusing (i.e. cloning ) pieces of existing pages without adequate documentation about these code duplications and redundancies. The presence of clones increase system complexity and the effort to test, maintain and evolve web systems, thus the identification of clones may reduce the effort devoted to these activities as well as to facilitate the migration to different architectures. This paper proposes an approach for detecting clones in web sites and web applications, obtained tailoring the existing methods to detect clones in traditional software systems. The approach has been assessed performing analysis on several web sites and web applications. 1. Introduction The rapid diffusion of the Internet and of the World Wide Web infrastructure has recently produced a considerable increase of the demand of new web sites and web applications (WA). The lack of method in developing these applications, besides the very short time-to-market due to pressing demand, very often result in disordered and chaotic architectures, and in inadequate, incorrect, and incomplete development documentation. Indeed, the development of a WA is generally performed in an incremental fashion, where additional pages are usually obtained by reusing the code of existing pages or page components, but without explicitly documenting these code duplications and redundancies. This in turn may increase code complexity and augment the effort required to test, maintain and evolve these applications. Moreover, if the WA are maintained and evolved with the same approach, further duplications and redundancies are likely to be added, and increased disorder may affect the code structure, and worsen its maintainability. This situation is similar to the one occurred in the past in the development and maintenance of large size systems where, especially as a consequence of poor design and of performed maintenance interventions, large portions of duplicated code was produced. These portions of duplicated code are generally called clones and clone analysis is the research field that investigates methods and techniques for automatically detecting duplicated portions of code in software artifacts. The approaches to clone analysis proposed in the literature are suitable for analyzing traditional software systems with a procedural or objectoriented implementation. In particular, methods based on the matching of Abstract Syntax Trees (AST), as well as on the comparison of arrays of specific software metrics, or on the matching of the character strings composing the code have been presented and experimented with. In the Internet era, web application are good candidates to clone proliferation, because of the lack of suitable reuse and delegation mechanisms in the languages generally used for implementing them. Moreover, this trend is reinforced by the hurried and unstructured approaches typically used for developing and maintaining web software. In this paper, we propose an approach for detecting clones in web sites or WAs. The approach has been obtained by tailoring the existing clone analysis methods in order to take into account the specific features of a WA. The approach addresses the detection of clones of static pages implemented in HTML language: two HTML pages will be considered clones if they have the same predefined

structural components (or properties), such as the components defining the final rendering of the page in a browser, or the components defining the processing of the application (like scripts, applets, modules, etc.). Moreover, two pages can be considered clones also if they are characterized by the same values of predefined metrics. In order to efficiently address the detection of cloned pages, the technique we propose takes into account only a limited set of components implementing relevant structural features of a page, but this limitation, however, does not affect the effectiveness of the approach. These elements are involved in the computation of a distance measure between web pages that can be used to determine the similarity degree of the pages. The validity of the proposed technique has been assessed by means of experiments involving several web sites and WAs. The experimental results showed that the approach adequately detects cloned pages. In order to carry out the experiments, a prototype tool has been developed that automatically obtains the distance between pages. The remaining part of the paper is structured as follows: Section 2 provides a short background in clone analysis, while Section 3 presents our approach to clone analysis. The experiment carried out to assess the approach is described in Section 4, and conclusive remarks are given in Section 5. 2. Background Clone analysis is the research area that investigates methods and techniques for automatically detecting duplicated portions of code, or portions of similar code, in software artifacts. These portions of code are usually called clones. The research interest in this area was born at the end of the 80s [Ber84] [Hor90] [Jan88] [Gri81] and focused on the definition of methods and techniques for identifying replicated code portions in procedural software systems. Clone detection could be performed to support different activities, such as recovering the reusable functional abstractions implemented by the clones to reengineer the system with more generic components, or correcting software bugs in each cloned fragment. A clone, usually produced by copying and eventually modifying a piece of code implementing a well defined concept, a data structure, or a processing item, can be generated for several reasons such as: lack of a good modular design not allowing an effective reuse of a piece of code implementing a common service; use of programming languages not providing suitable reuse mechanisms; pressing performance requirements not allowing the use of delegation and function call mechanisms; undisciplined maintenance interventions producing replications of already existing code. The methods and techniques for clone analysis described in the literature focus either on the identification of clones that consist of exactly matching code portions (exact match) [Bak95,Bak93,Bak95b], either on the identification of clones that consist of code portions that coincide, provided that the names of the involved variables and constants are systematically substituted (p-match or parameterized match). The approach to clone detection proposed in [Bal00] and [Bal99] exploits the Dynamic Pattern Matching algorithm [Kon96][Kon95] that computes the Levenstein distance between fragments of code: each fragment is represented by a sequence of tokens and two fragments are considered clones if their Levenstein distance value is under a given threshold. The approach described in [Bax98] exploits the concept of near miss clone, that is a fragment of code that partially coincides with another one. Ducasse and Reiger propose an approach to clone detection that is independent of the coding language used for implementing the subject systems [Duc99]. Further approaches, such as the ones proposed in [Kon97][Lag97] [May96] [Pat99], exploit software metrics concerning the code control-flow or data-flow. In the Internet era, web sites and web application are good candidates to clone proliferation, because of the lack of suitable reuse and delegation mechanisms in the languages generally used for implementing them 1. At the moment, a considerable growth of the size of web sites and WAs can be observed, and the necessity of effectively maintaining these applications is spreading fast [Ric00] [War99]. Therefore, the effectiveness of traditional clone analysis techniques in the context of WAs should be assessed, and suitable approaches for tailoring these techniques in the renewed context should be investigated. Clones can be looked for in web software with different aims, such as for gathering information suitable to support its maintenance, migration towards a dynamic architecture, and also to cluster similar/identical structures, facilitating the process of separating the content from the user interface (that may be a PC browser, a PDA, a WAP phone, etc.). One of the difficulties in analyzing clones in web software derives from the wide set of technologies available for implementing web sites and WAs, that makes it harder the choice of the replicated software components to be 1 In general, a web site may be thought of as a static site that may sometimes display dynamic information. In contrast, a WA provides the Web user with a means to modify the site status (e.g. by adding/ updating information to the site).

looked for. Web sites and WAs include both static pages (e.g., HTML pages saved in a file and always offering the same information and layout to a client system) and dynamic pages (e.g., pages whose content and layout is not permanently saved in a file, but is dynamically generated). Therefore, the concept of clone may involve either the static pages or the dynamic ones. Web pages include a control component (e.g., the set of items determining the page layout, business rule processing, and event management) and a data component (e.g., the information to be read/displayed from/to a user). Therefore, the clones to be detected may involve either the control or the data component of a page. Since the control and the data component of a dynamic page depend on the sequence of events that occurred at runtime, searching for clones in these pages should involve dynamic analysis techniques. Vice-versa, the structure of a static page is predefined in the file that implements it, and clone detection can be carried out by statically analyzing the file. In this paper, we focus on techniques for detecting clones among web static pages. In particular, a clone will be thought of as an HTML page that includes the same set of tags of another page, since a tag is the means used for determining the control component in a static page. In the paper, among the various approaches proposed in the literature for clone analysis, the technique based on the Levenstein distance will be analyzed. Moreover, a frequency based approach will be proposed, and the validity and effectiveness of both approaches will be discussed. 3. An approach to clone analysis for web systems 3.1 The Levenstein distance The comparison of strings is used with similar aims in several fields, such as molecular biology, speech recognition, and code theory. One of the most important models for string comparison is the edit distance model, based on the notion of edit operation proposed in the 1972 [Ula72]. An edit operation consists of a set of rules that transform a character from a source string in a new character in a target string. The alignment of two strings is a sequence of edit operations that transforms the former string into the latter one. A cost function can be used to associate each edit operation with a cost, and the cost of an alignment is the sum of the costs of the edit operations it includes. The concepts of optimum alignment and longest common subsequence are related with the definition of Levenstein distance too. The edit distance can be defined as the minimum cost required to align two strings; an alignment is optimum if its cost coincides with the minimum cost, that is the edit distance. If we consider a unitary cost function (e.g., a cost function that associates each edit operation with unitary cost), the edit distance can be defined as the unit edit distance. The unit edit distance is also called Levenstein distance: the Levenstein distance D(x, y) of two strings x and y is the minimum number of insert, replacement or delete operations required to transform x into y. Moreover, a subsequence of a string (i.e. a substring) consists of each string obtainable by deleting zero or more characters from the string. A common subsequence of two strings is a sub-string that is contained in both strings, while the longest common subsequence of two strings is the common longest sub-string in both ones. As an example, given the strings informatics and systematics, the longest common subsequence is the string matics, while the Levenstein distance of the strings is 10. i n f o r s y m a t i c s s t e m a t i c s 3.2 Detecting cloned pages by the Levenstein distance The computation of the Levenstein distance requires that an alphabet of distinct symbols is preliminary defined. In order to define this alphabet, the items implementing the relevant features of a static web page must be identified. Since our approach focuses on the degree of similarity of the control components of two static pages, disregarding the data components, a candidate alphabet will include the set of HTML tags implementing the control component of a page. In this way, a string composed of all the HTML tags in the page will be extracted from each web page and the Levenstein distance between couples of these strings will be used to compare couples of pages. Since the Levenstein distance represents the minimum number of insert or delete operations required to transform a first string into a second string, its value expresses the degree of similarity of two static pages. In particular, if the distance value is 0, the pages will be cloned pages, while if the distance is greater than 0, but less than a sufficiently small threshold, the pages are candidate to be near missing clones. In order to improve the effectiveness of the approach, the risk of detecting misleading similarities between pages, or the risk of not detecting meaningful similarities have to be minimized. The first type of risk, for instance, may depend on the approach used to manage the set of attributes that characterize each tag. In fact, in HTML the same sequences of attributes can refer to different tags, and their detection may produce false positives if they were not linked to the

HTML files and Α 2 alphabet Α * alphabet of tags Tag extraction and composite tag substitution Strings of Α 2 symbols Elimination of symbols not belonging to Α * Strings of Α * symbols Levenstein distance computation Distance matrix Figure 1: The process of cloned page detection correct tag. The second type of risk is connected both with the problem of the composite tags, that are sequences of tags providing a result equivalent to another single tag, and, finally, with the categories of tags that influence only the format of the data, like tags for text formatting, font selection and for inserting hyper-textual links. These problems can be solved by refining the preliminary alphabet including all the HTML tags, and substituting each composite tag in the alphabet with its equivalent tag: the resulting alphabet will be called Α 2. The set of tags that establish the data formatting will be eliminated and a new refined alphabet Α * will be obtained. Α * will include the set of tag attributes too, provided that they are correctly associated with the tag they belong to. The detection of cloned static pages will be therefore carried out according to the process described in Figure 1. In the first phase, the HTML files are parsed, their tags are extracted and the composite tags are substituted with their equivalent ones. The resulting strings will be composed of symbols from the Α 2 alphabet. These strings will be processed in order to eliminate the symbols that do not belong to the Α * alphabet. These final strings will be submitted to the computation of the Levenstein distance: the Distance matrix will finally include the distance between each couple of analyzed strings. 3.3 Detecting cloned pages with a frequency based method The method based on the Levenstein distance is in general very expensive from a computational point of view: in fact, in order to determine an edit distance, all the possible alignments between strings should be evaluated, until the optimal alignment is determined. The computational complexity of the algorithm for computing the Levenstein distance is in fact O (n 2 ) where n is the length of the longer string. A frequency based method to detect clones in web systems has been investigated too. The method requires that each HTML page is associated with an array whose components represent the frequencies (i.e., the occurrences) of each HTML tag in the page. The dimension of the array coincides with the number of considered HTML tags, and the i-th component of the array will provide the occurrence of the i-th tag in the associated page. Given the arrays associated with each page, a distance function in a vectorial space can be defined, such as the linear distance or the Euclidean distance. Exact cloned pages will be represented by vectors having a zero distance, since they are characterized by the same frequency of each tag, while similar pages will be represented by vectors with a small distance. Of course this method may produce false positives, since even completely different pages may exhibit the same frequencies but not the same sequence of tags, especially when the pages have a small size or use a limited number of tags. However, the lower precision of this method is counterbalanced by its computational cost, that is lower than the Levenstein distance one. 4. A case study A number of Web systems have been submitted to clone analysis using the proposed approaches, with the aim of assessing their feasibility and effectiveness. A prototype tool that parses the files, extracts the tags, produces the strings and automatically computes the distances between the pages has been developed to support the experiments. This section provides the results of a case study involving a WA implementing a juridical laboratory with the aim of supporting the job of professional lawyers. The WA includes 201 files distributed in 19 directories and

Table 1: The HTML files analyzed in the case study File ID File Name KB 1 \index.htm 8.07 2 \Specialisti\MainFrame.htm 0.411 3 \Specialisti\Specialisti.htm 1.75 4 \Specialisti\Text.htm 2.30 5 \Specialisti\Title.htm 0.363 6 \Novita\Brugaletta.htm 6.57 7 \Novita\CalendarioTarNA.htm 10.6 8 \Novita\CalendarioTarSA.htm 11.2 9 \Novita\MainFrame.htm 0.509 10 \Novita\Novita.htm 1.82 11 \Novita\RivisteConsOrdAvvSa.htm 31.9 12 \Novita\Text.htm 3.30 13 \Novita\Title.htm 0.409 14 \Forum\Forum.htm 1.79 15 \Forum\MainFrame.htm 0.506 16 \Forum\Text.htm 0.237 17 \Forum\Title.htm 0.4 18 \Common\FrameLeftPulsanti.htm 4.78 19 \Common\bottomFrame.htm 3.21 20 \ChiSiamo\ChiSiamo.htm 1.75 21 \ChiSiamo\MainFrame.htm 0.494 22 \ChiSiamo\Text.htm 3.24 23 \ChiSiamo\Title.htm 0.407 24 \Cerca\Cerca.htm 1.87 25 \Cerca\MainFrame.htm 0.501 26 \Cerca\Text.htm 27.3 27 \Cerca\Title.htm 0.4 28 \Caso\Caso.htm 1.92 29 \Caso\MainFrame.htm 0.492 30 \Caso\Text.htm 7.29 31 \Caso\Title.htm 0.401 32 \Caso\Testi\Autovelox.htm 13.6 33 \Caso\Testi\Corruzione_Identificazione_atto.htm 26.4 34 \Caso\Testi\Danno_biologico.htm 25.3 35 \Caso\Testi\Mobbing.htm 40.9 36 \Caso\Testi\Mobbing_nel_pubblico_impiego.htm 3.75 37 \Caso\Testi\Occupazione.htm 32.7 38 \Caso\Testi\Oltraggio.htm 14.8 39 \Caso\Testi\Parentelemafiose.html 23.2 40 \Caso\Testi\Problematica_beni_confiscati.htm 33.3 41 \Caso\Testi\Professioni_intellettuali.htm 29 42 \Caso\Testi\Relazione_attivita_commissario.htm 0.305 43 \Caso\Testi\Responsabilita_amministrativa.htm 13.1 44 \Caso\Testi\Responsabilita_medica.htm 45.8 45 \Caso\Testi\Responsabilita_medico.htm 46.9 46 \Caso\Testi\Riflessioni- Omicidio_di_Peppino_Impastato.htm 37.6 47 \Caso\Testi\Societa_miste.htm 35.2 48 \Caso\Testi\Truffa_in_attivita_lavorativa.htm 44.2 49 \Caso\Testi\Uso_beni_condominiali.htm 30.7 50 \Caso\Testi\Misure_patrimoniali_nel_sistema.htm 20.6 51 \Archivio\Archivio.htm 1.87 52 \Archivio\MainFrame.htm 0.43 53 \Archivio\Text.htm 12.9 54 \Archivio\Title.htm 0.406 Table 2: Couples of clones with null Levenstein distance (3,10) (3,14) (3,20) (3,24) (3,28) (3,51) (9,15) (9,21) (9,25) (9,29) (10,14) (10,20) (10,24) (10,28) (10,51) (13,17) (13,23) (13,27) (13,31) (13,54) (14,20) (14,24) (14,28) (14,51) (15,21) (15,25) (15,29) (17,23) (17,27) (17,31) (17,54) (20,24) (20,28) (20,51) (21,25) (21,29) (23,27) (23,31) (23,54) (24,28) (24,51) (25,29) (27,31) (27,54) (28,51) (31,54) Table 3: Clusters of clones Cluster A 3-10 - 14 20-24 - 28-51 Cluster B 9-15- 21 25-29 Cluster C 13-17 - 23-27 - 31-54

Figure 2: A couple of cloned pages its overall size is 4,26 Mbytes. Its HTML static pages are implemented by 54 files with htm extension distributed in 10 directories, while 19 files with the asp extension and contained in 4 directories implement 19 server pages. The remaining files includes data or other objects, like images, logos, etc., to be displayed in the pages. The 54 HTML files have been submitted to the clone analysis according to the proposed approach. The name and the size of each analyzed file is listed in Table 1. The Levenstein distances between each couple of pages have been computed using the Α 2 alphabet, and the Distance matrix has been obtained. The Matrix included 46 couples of perfect cloned pages involving 18 distinct files. The couples of cloned pages are listed in the following Table 2, where each page is identified by the file ID shown in Table 1. Moreover, the Distance Matrix included 25 couples of pages with a very low distance that made them potential near missing clones. The 46 perfect couple of cloned pages have been visualized with a browser in order to validate the results of the analysis, and each couple actually implemented perfect clones. As an example, Figure 2 shows the rendered HTML pages corresponding to the couple of clones (10, 28). In similar way, the 25 couples of pages representing near missing clones have been visualized with the browser, and their relatively small differences confirmed that they could not be considered perfect clones. The 18 files implementing the 46 couples of perfect clones were further analyzed and they could be grouped into three different clusters of identical or very similar pages. Table 3 reports the three clusters of pages. The pages from the same cluster were actually very similar, and their differences were essentially due to the parametric components providing the information displayed in the pages. Their similarity was essentially due to the framebased structure of the application. In particular, the pages from the A cluster represented the roots of sub-trees of the web site all reachable from the home page of the application; all the pages of the B cluster were implemented by files with the same name Mainframe.htm, while the pages of the C cluster were all implemented by files with the same name Title.htm. Using the frequency based method, the same set of clones was obtained and no additional clone was detected. However, the second method produced more near missing clones than the Levenstein method. It is worthwhile noting that also in all the other experiments, involving other web systems, we carried out the frequency based method produced always the same set of clones detected by applying the Levenstein distance and no additional clones (i.e. false positives) were detected. Even if both the approaches detected the same set of clones, their computational costs were sensibly different. In particular, the computation of the Levenstein distance for all couples of pages required 2 hours and 50 minutes, while just 15 seconds were necessary for computing the frequency based distances (on a PC with a Pentium III 850 MHz processor). In order to reduce the computational complexity of the Levenstein method and the potential inaccuracy of the frequency based one, an opportunistic approach may be proposed. This approach will use the frequency based method for preliminarily identifying potential couples of clones, and apply the Levenstein method over these couples for detecting the actual clones and rejecting the false ones. 5. Conclusions In this paper an approach to clone analysis in the context of web systems has been proposed. Clone detection allows to highlight reuse of pattern of HTML tags (i.e., recurrent structures among pages,

implemented by specific sequences of HTML tags), provides an approach to facilitate web software maintenance, and the migration to a model where the content is separated from the presentation. Moreover, identifying clones facilitates the testing process of a WA, since it is possible to partition the pages in equivalence classes, and specify a suitable number of test-cases accordingly. Two methods for clone analysis have been defined and experimented with. We considered as clones the pages having the same control components, even if they differed for the data components. During the experiment, the proposed methods detected clones among static web pages, and a manual verification gave us confirmation about the methods effectiveness. The two proposed methods have produced results that are comparable but with different computational costs. Since the frequency based method produced, in all the experiments, always the same set of clones obtained by applying the Levenstein distance method, but with a very low computational cost, it could be an effective method for web static page clones detection. Future works will be devoted to further experimentation to better validate the proposed methods. Moreover, approaches based on the use of other suitable software web metrics to identify clones, as well as further approaches to identify clones among server pages, will be investigated. References [Bak93] Baker S. B., A theory of parametrized pattern matching: algorithms and applications, in Proceedings of the 25 th Annual ACM Symposium on Theory of Computing, 71-80, May 1993. [Bak95] Baker B. S., On finding duplication and near duplication in large software systems, in Proc. of the 2 nd Working Conference on Reverse Engineering, IEEE Computer Society Press, 1995. [Bak95b]Baker S. B., Parametrized pattern matching via Boyer- Moore algorithms, in Proceedings of Sixth Annual ACM- SIAM Symposium on Discrete Algorithms, 541-550, Jan 1995. [Bal00] Balazinska M., Merlo E., Dagenais M., Lagüe B., Kontogiannis K., Advanced clone-analysis to support object-oriented system refactoring, in Seventh Working Conference on Reverse Engineering, 98-107, Nov 2000. [Bal99] Balazinska M., Merlo E., Dagenais M., Lagüe B., Kontogiannis K., Measuring clone based reengineering opportunities, in International Symposium on software metrics. METRICS 99. IEEE Computer Society Press, Nov 1999. [Bax98] Baxter I. D., Yahin A., Moura L., Sant Anna M., Bier L., Clone Detection Using Abstract Syntax Trees, in Proceedings of the International Conference on Software Maintenance, 368-377, IEEE Computer Society Press, 1998. [Ber84] Berghel H.L., Sallach D.L., Measurements of program similarity in identical task environments, SIGPLAN Notices, 9(8):65-76, Aug 1984. [Duc99] Ducasse S., Rieger M., Demeyer S., A Language Indipendent Approach for Detecting Duplicated Code, in Proceedings of the International Conference on Software Maintenance, 109-118, IEEE Computer Society Press, 1999. [Gri81] Grier S., A tool that detects plagiarism in PASCAL programs, in SIGSCE Bulletin, 13(1), 1981. [Hor90] Horwitz Susan, Identifying the semantics and textual differences between two versions of a program, in Proceedings of ACM SIGPLAN Conference on Programming Language Design and Implementation, 234-245, Giugno 1990. [Jan88] Jankowitz H.T., Detecting plagiarism in student PASCAL programs, in Computer Journal, 31(1):1-8, 1988. [Kon96] Kontogiannis K., DeMori R., Merlo E., Galler M., Bernstein M., Pattern Matching for clone and concept detection, in Journal of Automated Software Engineering, 3:77-108, Mar 1996. [Kon95] Kontogiannis K., DeMori R., Bernstein M., Merlo E., Pattern Matching for Design Concept Localization, in Proc. of the 2 nd Working Conference on Reverse Engineering, IEEE Computer Society Press, 1995. [Kon97] Kontogiannis K., Evaluation Experiments on the Detection of Programming Patterns Using Software Metrics, in Proc. of the 4 th Working Conference on Reverse Engineering, 44-54, 1997. [Lag97] Lagüe B., Proulx D., Merlo E., Mayrand J., Hudepohl J., Assessing the benefits of incorporating function clone detection in a development process, in Proceedings of the International Conference on Software Maintenance 1997, 314-321, IEEE Computer Society Press, 1997. [May96] Mayrand J., Leblanc C., Merlo E., Experiment on the Automatic Detection of Function Clones in a Software System Using Metrics, in Proceedings of the International Conference on Software Maintenance,. 244-253, IEEE Computer Society Press, 1996. [Pat99] Patenaude J.-F., Merlo E., Dagenais M., Lagüe B., Extending software quality assessment techniques to java systems, in Proceedings of the 7 th International Workshop on Program Comprehension IWPC 99, IEEE Computer Society Press, 1999. [Ric00] Ricca F., Tonella P., Web Analysis: Structure and Evolution, in Proceedings of the International Workshop on Web Site Evolution, 76-86, 2000. [Ula72] Ulam S.M., Some Combinatorial Problems Studied Experimentally on Computing Machines, in Zaremba S.K., Applications of Number Theory to Numerical Analysis, 1-3, Academic Press, 1972. [War99] Warren P., Boldyreff C., Munro M., The evolution of websites, in Proceedings of the International Workshop on Program Comprehension, 178-185, 1999.