PatternRank: A Software-Pattern Search System Based on Mutual Reference Importance

Size: px

Start display at page:

Download "PatternRank: A Software-Pattern Search System Based on Mutual Reference Importance"

Hilary Malone
5 years ago
Views:

1 PatternRank: A Software-Pattern Search System Based on Mutual Reference Importance Atsuto Kubo, Hiroyuki Nakayama, Hironori Washizaki, Yoshiaki Fukazawa Waseda University Department of Computer Science and Engineering Okubo, Shinjuku, Tokyo, Japan {a.kubo,h-nakayama}@fuka.info.waseda.ac.jp {fukazawa,washizaki}@waseda.jp Abstract There are currently a large number of digitized documents on software patterns (hereafter patterns ) published on the Web. However, there are no search systems available that specialize in pattern documents, so in order to find a desired pattern, users must resort to manual searches using generalized search systems. When using a generalized search system the search result also includes documents that are not pattern documents, and if a large number of results are returned, randomly searching for the desired pattern can be very inefficient. We propose a pattern search system as a solution to this problem. The proposed method operates on a set of pattern documents gathered in a repository, attaches and importance value to them based on relationships between them, and uses it together with sorting and keyword searching. We conducted search experiments using the proposed system, applied to 131 pattern documents published on the Web, in order to experiment the effectiveness of the system. The experiment results confirmed that patterns can be studied and investigated effectively using the proposed system. 1. Introduction The term Software Patterns (hereafter patterns ) refers to solutions or guidelines for certain types of problems that occur repeatedly in various situations in software development, taking into account the various constraints that must be considered when solving the problem [15, 20]. Patterns can be used to solve problems that occur frequently in an efficient manner, by reducing the amount of analysis and design work required in software development. Studying patterns can also provide a deeper understanding of the types of problems handled by each pattern and the underlying software-development techniques. For example, since the Abstract Factory pattern is used in the design of Graphical User Interfaces (GUI), studying the pattern provides insight into design policies used in GUI application frameworks as well as awareness of one of the problems arising in the implementation of GUIs using object-oriented techniques. It can also provide direction on issues like use of abstract classes in object-oriented design. Since the release of Design Patterns by Gamma et al. [3], the number of available patterns has increased through the activities of the pattern community, so that now there are a large number of patterns published in the literature or in electronic-document format on the Web [9]. As this number has increased, the need for a search system that can efficiently find a reference pattern conforming to the user s needs and intentions from among a large number of pattern documents has also increased. A search system specialized for patterns will be able to improve the efficiency of pattern use in software development, and of the study of patterns. Our proposing method is built onto the repository contains only the pattern documents. As an example, the Portland Pattern Repository [1] is an experimental Web site, gathering pattern documents and allowing searches to be performed on them. Supposing that we are developing a system and would like a pattern for designing windows in our GUI, we could enter the word window as a query into the Portland Pattern Repository. Of the 32,334 documents in the repository, this query yields 844 documents that are presented in order of document name (as of March 2007). From the perspective of pattern usefulness this name order is essentially random, and the results also include documents that are not patterns (for example, articles about patterns), so the user may have to read up to 844 documents in order to find the required pattern. In this paper we propose a method which automatically extracts relationships between patterns from the pattern documents and computes an importance value for each

2 pattern based on these inter-pattern relationships. We call this method the PatternRank method. Computation of an importance value in the PatternRank method is similar to the importance value computed for Web pages in the Page Rank method [11] or for software components in the Component Rank method [7]. The PatternRank method is somewhat different, however, in that it must take into consideration information particular to patterns, such as the fact that the same pattern may be expressed differently in different documents, or belong to several different catalogs. We also propose an application for the PatternRank method, which is a search system that applies the pattern importance values from the PatternRank method to search results and presents them in order of importance. The proposed system is able to compute the importance values against the set of patterns returned by a search query. 2. Important patterns and the need for a search system In this section we explain the format that is generally used to describe patterns, and discuss what sorts of important patterns are needed by users. In the discussion below, when writing relationship, we are referring to any sort of relationship that may exist among multiple elements, not limited to simple pattern relationships. We write inter-pattern relationships to indicate specifically a relation among multiple patterns. The term Related is only used to indicate Related Patterns item in a pattern description Pattern characteristics Generally, patterns are described in terms of a collection of items like pattern name, assumed conditions and requirements, problems, forces (constraints that must be considered), solutions, and related patterns [20]. A description of a common design pattern called the Singleton pattern [3] is shown in Figure 1 as an example of a pattern description. The Pattern Name item gives the name of the pattern as Singleton, the Problem item describes the problem that the pattern solves, and the Solution item describes the solution provided by the Singleton pattern. The Related Patterns item, shows that fact that the Singleton pattern references the Abstract Factory pattern [3]. As shown in Figure 1, the Singleton pattern is described in the GoF design pattern catalog. A pattern catalog is a collection of patterns, organized and classified according to their characteristics and with their mutual relationships. Patterns are not practical individually, but form pattern systems together with other patterns, and by working in cooperation with these related patterns, form solutions to large Pattern Name: Singleton... Related Patterns: The Singleton design pattern can be used with the Abstract Factory design pattern to guarantee at most one... Figure 1. Excerpted pattern description of Singleton pattern problems by gathering together the small solutions represented by each individual pattern [14]. For this reason, patterns in the same catalog are frequently used together Important patterns Problems occurring in software development can be solved efficiently by using patterns. Patterns are also useful for presenting the types of problems and solutions that occur frequently in software development and for studying software development techniques like programming and object-oriented design [12]. Several other pattern search methods have been proposed [1, 8, 15, 17], but although some managed to improve precision by searching over a repository, they do not apply an ordering to the search result, so search-result effectiveness drops if there are many candidate patterns conforming or related to the user s search query. As a result, it is necessary to automatically apply an ordering to the patterns in the repository according to some index. It is desirable to have an ordering for the pattern collection in which the more-important patterns come before less-important patterns. From the perspective of pattern-use, with the following characteristics are more important: Easy to use in software development Used often in software development Often searched-for on the Web Referenced often in other pattern documents. Sorting pattern collections based on how frequently they are used in software development would be extremely effective from a pattern reuse perspective. Patterns that have achieved a high rank through use in software development in the past have an extremely high probability of being used in future software development as well. However, data on past pattern use must be gathered on each past software development project, so this is extremely difficult to implement practically.

3 Sorting a pattern collection based on how frequently patterns are searched-for on the Web could be effective from a pattern-reuse perspective. Pattern documents on the Web are likely to be searched-for frequently to solve problems occurring in software development, so patterns used often on the Web are likely to be used in the future for software development. However, patterns unrelated to software development that are searched-for frequently could achieve a high rank in spite of not being useful for software development. This approach would also require gathering data from various Web sites publishing patterns, making it very difficult to implement. Sorting pattern documents based on the frequency with which they are referenced within pattern documents could also be useful from a pattern re-use point of view. Patterns that are referenced frequently by other patterns would be general-purpose, basic patterns, and would achieve high rank. These patterns provide general solutions to many problems, and so highly-ranked patterns are more likely to be used in future software development as well. This approach also has the benefit of being relatively easy to implement compared to the other methods. However, since the ordering is done based on the viewpoint of three pattern creator rather than the intentions or requirements of the user, it may not reflect the user s needs well. Considering usefulness and practicality, the PatternRank method uses the frequency of references in other pattern documents as a part of the pattern-importance calculation. A pattern-search system with the ability to sort the pattern set based on the frequency with which it is referenced by other pattern documents will be a useful support tool for using patterns effectively. When studying patterns, it is important to consider the educational effectiveness of patterns, or which pattern among many should be studied first for the best and most efficient understanding. Studying a collection of patterns in order of importance from the perspective of education would help reduce the study time required and provide a deeper understanding compared to studying patterns in random order. Students will be able to study patterns more effectively if they refer to patterns they already know than otherwise. So, when studying patterns in order to learn the development methods behind them, it may be more effective to study the patterns that are referenced more first, allowing later patterns to be studied more efficiently. For example, if the references between patterns are as shown in Figure 2, the reference from pattern p 3 to p 1 indicates that p 1 is used as part of the construction of p 3, so p 1 is likely given as a related pattern in the document for pattern p 3. If we are studying p 3, and p 1 has been learned previously, it will be possible to understand the explanation of p 3 in a shorter amount of time. However, if p 2 is studied first, the benefit of learning Figure 2. Interpattern relationships p 1 is not available. Thus, in cases like Figure 2, pattern p 1 is more important than p 2 in terms of learning about pattern p Pattern Search based on mutual reference importance In this chapter, we describe the PatternRank method, which automatically computes a pattern importance value, based on inter-pattern relationships, for each pattern selected from a collection of pattern documents. Inter-pattern relationships are classified into reference relationships, which indicate that one pattern makes reference to another, same-catalog relationships, which indicate that the patterns belong to the same pattern catalog, and project-use relationships, indicating that the patterns were used together in the same software project. We propose a search system which presents the set of patterns from a user s search query in an order based on the importance values derived from the PatternRank method. With the proposed system, application of the PatternRank method allows the user to use the more-important patterns effectively even when the search result contains a very large number of candidates Computing importance value based on inter-pattern relationships The Pattern-Rank method is a scheme which computes an importance value for elements based on relationships between these elements, drawing on the Page Rank method [11], which computes importance values for Web pages, and the Component Rank method [7], which does so for software components. The Page Rank method handles hyperlink reference between pages and uses a rule that says Web pages referenced by important web pages are also important. The PatternRank method specializes the Page Rank method for use on patterns. Pattern documents in the PatternRank method correspond to Web pages in the Page Rank method, and references to other patterns correspond to hyper-links in the Page Rank method. Also, as described in section 2.2, the PatternRank

4 method defines importance according to the following definition: Rule 1: The more patterns that reference a pattern, the more important it is. Rule 2: The more important are the patterns referencing a pattern are, the more important the pattern itself is. Through rule 1, an ordering appropriate for application or for studying is given to a group of patterns using the importance value for each pattern, so the higher the importance value of a pattern, the more likely it is a general-use pattern and a base for other patterns, and the more likely it is used together with other patterns. For example, The Factory Method pattern in the GoF design-pattern catalog references the Template Method pattern in the implementation of its solution. Here, from both the application and the learning perspectives, if the Template Method pattern is learned first, the Factory Method pattern will be easier to understand. Hence, the Template Method can be considered to be more important. Rule 2 allows the importance of the referencing patterns to be reflected in the importance of the referenced pattern. As an example, if rule 2 is not used and importance is only calculated based on the simple number of references, then the difference between a pattern, p 1, reverenced only by unimportant patterns and a pattern, p 2, referenced by the same number of patterns of much higher importance, would not be expressed. Applying rule 2 here correctly shows that pattern p 2 is more important than p 1. An example of computing importance using the Pattern- Rank method is shown in Figure 3. Pattern p 1 references pattern p 2 and pattern p 4, while pattern p 3 references p 4. Pattern p 1 has importance value 0.5, while p 3 has importance 0.1. In the PatternRank Method, first a weighting for each reference edge is computed from the importance of the referencing pattern and the number of reference edges. For example, the weights of reference edges e 12 and e 14 from the referencing pattern p 1 are both 0.25, because there are two, and 0.5 / 2 = Similarly, the weighting for reference edge e 34 from p 3 is 0.1. This distribution of the pattern importance as the weightings on the reference edges is called the distribution ratio. Then the importance of each pattern is computed as the total of the weightings of incoming reference-edges. For example, the importance for pattern p 4 is = 0.35, and for p 2 it is The propagation of importance values through the PatternRank computation is shown in Figure 4. The importance values of all patterns are initially set to total one, with this importance allocated evenly to all patterns, so in Figure 4, each pattern is initially assigned an importance of Then, the reference edge weightings are calculated by dividing the importance of the referencing pattern by the number Figure 3. A step of importance propagation Figure 4. Convergence process of importance of reference edges. The State pattern references two other patterns, so the weights for the two edges are both 0.25/2 = Then the pattern importance values are re-calculated from the reference edges by totaling the weightings of the in-coming reference edges. In Figure 4, the Fixed Sized Buffer pattern has two incoming edges, so its importance is calculated as = This process is repeated, and the importance value for each pattern converges on a fixed value. In Figure 4, the State pattern converges to , and Fixed Sized Buffer to In the example in Figure 4, all of the patterns reference others, but if there are patterns that do not reference other patterns, the importance weighting from those patterns does not propagate. Weighting does propagate to these patterns from those that reference them, though, so the total weighting decreases by the weighting of these patterns with each iteration, preventing the computation from converging. We discuss how this problem is handled in detail in section

5 3.1.1 Pattern-Specific importance correction With the Page Rank and Component Rank methods, importance is computed based on only hyper links or dependency relationships between components. However, if pattern importance is computed based only on inter-pattern reference relationships, the following two problems arise. If the same pattern document is published on different Web sites, it will be treated as a different pattern. In cases where patterns in the same catalog are used together frequently even though there is no explicit reference relationship between them, they will be treated as though there is no relationship between them. For the former, we take the approach of unifying pattern documents that can be considered the same before performing the importance computation. We discuss details of how this is done in section For the latter, concerning patterns in the same catalog, we classify inter-pattern relationships handled by the PatternRank method into two types as described below. Pattern-Reference Relationship A relationship between patterns arising from one pattern referencing another and described in the pattern document by the author. For two patterns that have a pattern reference relationship, the pattern making the reference called the referrer and the one being referred-to is called the referee. When one pattern references another pattern, it could indicate that the solutions the patterns provide are similar in software structure, that one pattern is a derivative of the other (handling a more specialized problem), or that one pattern uses the other to provide its solution. In this paper, if a pattern p 1 makes reference to another pattern p 2, we say that a pattern reference relationship exists from p 1 to p 2. Note that we do not limit this to items mentioned in the Related Patterns section of the pattern format, but also consider any pattern names appearing in the entire text of the document to indicate a pattern reference relationship. (see the transcript file for additional information) Same-Catalog Relationship If two patterns, p 1 and p 2 belong to the same pattern catalog, we say they have a Same-Catalog Relationship. Pattern catalogs can be seen as collections of patterns that have mutual relationships among them. Even if patterns in the same catalog do not have an explicit relationship with each other, they are often used together. For example, there is no explicit relationship between the Composite and Proxy patterns from the GoF design-pattern catalog [2][5] but they are often used together. The Pattern- Rank method treats patterns in the same catalog as being related for this reason Computation model for the Pattern- Rank Method Below, we provide a mathematical description of the pattern-importance calculation described above. We express a set of patterns as a directed graph, G = (P, E), consisting of a set, P, of vertices, p, each representing a pattern, and a set, E, of directed edges, e, each representing a reference relationship between patterns. The importance value for each pattern is expressed as the weight, w(p) of the corresponding vertex, p. By definition we set the total weight of all vertices in the graph to 1, and 0 < w(p) 1 for all vertices at all times. b We express the weighting of edge e from vertex p i to vertex p j as, w (e ) = d w(p i ) The distribution ratios, d satisfy d = 1 and i 0 d 1 The weight of each vertex, p i, is set equal to the sum of the weights of all edges e i in the set IN(p i ) of edges terminating at p i. This can be expressed as: w(p i ) = e ki IN(p i ) w (e ki ) Combining this with the definition of distribution ratios gives the weighting for each vertex as: w(p i ) = d ki w(p k ) e ki IN(p i) Next, we express the weights of all vertices as a vector, W : w(p 1 ) w(p 2 ) W =. w(p n ) And the matrix, D, of all distribution ratios as: (see the transcript file for additional information) d 11 d d 1n d 21 d d 2n W = d n1 d n2... d nn The computation of the weightings for the entire graph can be expressed in terms of W and D as W t+1 = D t

6 W t. D t is the transverse of matrix D, and the subscript t indicates the iteration number of the computation. The above calculation is the same as computing the eigenvectors of a system of simultaneous equations in linear algebra, and can be computed iteratively. An iterative method is used in which the initial vector, x 0 is set to some reasonable value, x (t+1) = A y(t), and y (t) = x (t) /c (t). As t =, x converges to the eigenvector with the largest eigenvalue, and c converges to that largest eigenvalue [21]. Here, A is an N-dimensional square matrix. From the Peron-Frobenius theorem, the absolute value of the maximum eigenvalue of a transition probability matrix is 1, so y (t) = x (t) and x (t+1) = Ax (t). In the PatternRank method, the importance values for each pattern are given by the values of each point in vector W when it has converged to a fixed value iterating as t =. Since an unlimited number of iterations is not possible, the computation is terminated when w t w t 1 w t 1 s for a particular threshold value, s. w t are the pattern vertex weights for the t th iteration of the calculation. Appropriate values for threshold value s are derived experimentally. The distribution ratios, d, are computed taking into account same-catalog relationships and independent patterns that lack pattern-reference relationships. First, the distribution ratios based only on patternreference relationships, d S, are given by: d S = 1/d sum Here d s um is the number of reference edges originating from the pattern vertex p i. Then, an overall expression for distribution ratios accounting for same-catalog relationships, d C, is given by: { d d C c =, where p i and p j belong to a same pattern catalog. d c, otherwise. d c = d S + d c i j d c d c = c = (1 c) d S d S ik/(c sum 1) k {C} Then overall distribution ratios considering matching patterns, d M, are given by: d M = d mo d mi d m d mo = k {m}d C kj /m sum d mi = d m k {m}d C ik = d C Distribution ratios, d i j, taking into consideration patterns that are not referenced by other patterns, and patterns that do not reference other patterns: { d r d =, wherep i hasanyreferencetoanotherpattern d r, otherwise d r = (1 r) d M + d d r = (1 r) d + d d = 1/n d = r/n The terms in each equation are discussed further in the following sections Pattern-catalog distribution ratios Even if there is no explicit relationship between patterns in pattern documents that belong to the same pattern catalog, these patterns are often used together. While we can conceive such a relationship even if there is not explicit relationship, calculations for methods like Page Rank and Component Rank that are based on relationships between elements are not able to reflect relationships like this in the importance value. The PatternRank approaches this problem by assigning pseudo-edges, e c between patterns that belong to the same pattern catalog, and giving these pseudo-edges a weighting that is weaker than regular reference edges. The weightings assigned to these pseudo-edges are created by subtracting a fixed proportion of the weighting from the weightings of reference edges going from patterns inside the catalog to patterns outside the catalog (Figure 5). The proportion of weighting decrease is defined as the pattern catalog distribution ratio. The overall distribution ratios for samecatalog relationships between patterns in a catalog are d C, the reference-relation distribution ratios between patterns with same-catalog relationships are d c, and distribution ratios for references to patterns outside the catalog are d c. Also, for distribution ratios in the distribution ratio matrix in the equation, d S, where p i and p j are referenced by a same pattern., where p, based on pattern-reference relation- i and p j refer a same pattern., otherwise. ships only, we define those on which pattern catalog distribution ratio computation was done as the pattern catalog distribution matrix. The distribution ratios taking same-catalog relationships into account are shown below.

7 Figure 5. Edges and Pseudo-edges Table 1. Calculated importance without pattern catalog distribution ratios Rank Pattern Catalog Importance 1 p 2 C p 5 C p 1 C p 6 C p 7 C p 3 C p 4 C The overall distribution ratios, d C are: { d d C c =, where p i and p j belong to a same pattern catalog. d c, otherwise. The distribution ratio d C for when p i and p j belong to the same pattern catalog is: d c = d S + d c i j The distribution ratio d c for pseudo-edges ec is: d c = c k {C} d S ik/(c sum 1) The distribution ratio, d c for points p i and p j with no same-catalog relationship is: d c = (1 c) d S d S is the distribution ratio calculated based only on pattern reference relationships, d c is the distribution ratio for pseudo-edge e c, c is the pattern catalog distribution ratio, c sum is the total number of patterns in the pattern catalog, and k c d ik is the total of distribution ratios to patterns that are not in the same catalog. Since the distribution rates for reference relationships between patterns that belong to the same catalog are shared as part of the total weight, the condition i d = 1 is still satisfied after distribution ratios are allocated. As an example of computing distribution ratios with same-catalog relationships, consider the weighting w(e c 14) from pattern p 1 to pattern p 4 in Figure 5. If the pattern catalog distribution ratio is 0.25, the weight for the pseudoedges is w(e c 14) from the reference edge equation w(e c ) = w(p i ) d S +w(ec ). Using the equations ec 14 = w(p 1 ) d c 14 and d c 14 = /(3 1) = 0.125, the weight of the pseudo-edge e c 14 is e c 14 = = Considering the weight w(e c 43) of the reference edge from pattern p 4 to pattern p 3 in the same way, we have equation w(e c 43) = w(p 4 ) d S 43 + w(e c 43). From w(p 4 ) d S 43 = = 0.05 and d c 43 = /(3 1) = , we have e c 43 = Table 2. Calculated importance with pattern catalog distribution ratios Rank Pattern Catalog Importance 1 p 5 C p 3 C p 7 C p 6 C p 1 C p 3 C p 4 C w(p 4 ) d c 43 = = so the reference edge weight, w(e c 43)isw(e c 43) = = By applying the distribution ratios taking same-catalog relationships into account to the pattern importance computation, the importance of other patterns in pattern catalogs containing many important patterns is raised overall. For example, if the importance computation is performed till convergence without using pattern catalog distribution ratios for Figure 5, the importance values for each pattern would be as shown in Table 1. Table 2 shows the corresponding importance values when using pattern catalog distribution ratios (with pattern catalog distribution ratio, c = 0.15, correction distribution ratio, r = 0.15, and threshold value s = 10 8 ). Table 1 shows that pattern catalog C 2 has more patterns with high importance than does pattern catalog C 1. Also, looking at both Table 1 and 2, use of the pattern catalog distribution ratio raised the level of patterns p 6 and p 7 in pattern catalog C Consideration of patterns from different sources that are the same For more well-known patterns, there are cases where several different pattern documents written by different authors have been published for the same pattern. In the PatternRank method, the importance calculation is done after pattern documents written by different authors

8 for the same pattern have been merged (Figure6). The overall distribution ratio considering patterns that are the same is d M, the ratio when the referenced patterns are the same is d mi, and when the referencing patterns are the same is dmo. When neither the referencing nor the referenced patterns are the same, the distribution ratio is d m. Expressions defining these are given below. Overall distribution ratio considering patterns that are the same (matching), d M d M = d mo, where p i and p j are referenced by a same pattern. d mi, where p i and p j refer a same pattern. d m, otherwise. Figure 6. Concatenation of pattern nodes Reference edge distribution ratios, d mo, when multiple referencing patterns, p i, match: d mo = k {m} d C kj/m sum Reference edge distribution ratios, d mi, when multiple referenced patterns, p i, match: d mi = k {m} d C ik Reference edge distribution ratio, d m, when neither referencing nor referenced patterns match. = d C k {m} dc kj is the total of all reference-edge distribution ratios when multiple referencing patterns, p i, match, k {m} dc ik is the total of all reference-edge distribution ratios when multiple referenced patterns, p j, match, and m sum is the number of patterns that match. Consider pattern p 3 in Figure 6 as an example of computing the distribution ratios taking matching patterns into account. Pattern p 1 references the pattern document containing pattern p 3, and pattern p 2 references a different document containing p 3. In the PatternRank method, first the patterns that are the same and associated reference edges are merged. The new p 3 has two incoming edges with weights of 0.5 and 0.1, and the new weighting of p 3 is the total of these two, or 0.6. The method for determining whether two patterns are the same is described in detail in Section 4. d m Patterns that are not referenced For patterns that are not referenced by other patterns the total weight of incoming reference edges is zero, so their importance value goes to zero. For example, since the weighting for a pattern like pattern p 1 is computed from the total Figure 7. Importance propagation including zero importance weighting of all incoming reference edges, its importance value will become zero. If there are patterns with an importance value of zero, the weighting for any reference edges originating from that pattern will also be zero. For the pattern reference relationships shown in the example in Figure 7, the total of reference-edge weights for patterns p 1 and p 2 are both 0.1, but since p 2 has two references from patterns with importance zero, this contradicts the first rule of the PatternRank method, that the more references a pattern has, the higher its importance. To account for this problem, the PatternRank method adds pseudo-edges, e, between all patterns, and gives these pseudo-edges a small weighting. The weights for these pseudo-edges are obtained by deducting a fixed proportion from the total weight of all of the pattern documents. Figure 8 shows how, by adding pseudo-edges to the whole pattern set, references are added to pattern p 2, which did not have any earlier. In this way, there are no longer any unreferenced patterns, and thus no patterns with importance of zero. The proportion deducted from the total weighting is called the correction distribution rate. We define the final over all distribution ratios, taking into consideration patterns that are not referenced by other patterns, and patterns that do not reference other patterns, as d. Expressions for distribution ratios taking unreferenced patterns into consideration are given below. Distribution ratios, d, taking into consideration pat-

9 Figure 8. fixing patterns without incoming reference Figure 9. fixing patterns without outgoing reference terns that are not referenced by other patterns, and patterns that do not reference other patterns: { d r d =, wherep i hasanyreferencetoanotherpattern d r, otherwise The distribution ratio, d, when p i references p j is: d r = (1 r) d M + d If there are patterns with an importance value of zero, the weighting for any r The distribution ratio, d, for pseudo-edge e is: d = r/n n is the total number of patterns in the set of pattern documents for the computation, d is the distribution ratio for pseudo-edge e, and r is the correction distribution ratio, satisfying 0 r 1. In this way, patterns that would otherwise have an importance value of zero are given a non-zero importance value, but if the correction distribution ratio, r, is made small enough, these weights converge to values that are small enough not to effect the relative rank of the patterns Patterns with no references If there are patterns that do not reference other patterns, the distribution ratios on reference edges from other patterns to these patterns go to zero, so that the i d = 1 constraint no longer holds, and the overall weighting computation as described in section 3.2 will no longer converge. For example, if a pattern, p 1, does not reference other patterns, it gets the total weighting of all edges referencing p 1, but the distribution ratios for any edges originating at p 1 are zero, so with each iteration, the total weighting decreases by the amount of all edges ending at p 1. The approach taken by the PatternRank method for this problem is to add pseudo-edges, e, from patterns that reference no other patterns to all other patterns. In Figure 9, pattern p 2 has no references to other patterns, so pseudoedges, e are added from pattern p 2 to all other patterns in the pattern set. By adding these pseudo-edges there are no longer any patterns that do not reference other patterns, making it possible to avoiding the problem of a decrease in the overall weight. The distribution ratio for reference edges on patterns that reference other patterns are defined as d r, and for patterns that do not reference other patterns, as d r. Expressions for distribution ratios taking patterns with no references into consideration are given below. Distribution ratio for pattern, p i, which references other patterns: d r = (1 r) d M + d Distribution ratio for pattern, p i, which does not reference other patterns: d d r = (1 r) d + d is the distribution ratio of e. These expressions impose a relatively strong relationship between patterns that would otherwise not be related, but we avoid affecting any particular pattern by giving all pseudo-edges equivalent weighting while representing the state where there are no references to other patterns. These computations, taking into consideration patterns that are not referenced or that do not reference others, are applied to the distribution matrix discussed in section 3.1. We call the matrix with these corrected distribution ratios applied the corrected distribution ratio matrix Software pattern importance computation process The distribution ratios, d, in the distribution ratio matrix are obtained by first computing the distribution ratios for pattern reference relationships and then applying pattern catalog distribution ratios, same-pattern, and corrected distribution ratio processing Search system using the PatternRank method The process for using the proposed system, which applies the PatternRank method, is shown below (Figure 10).

10 Figure 10. System Overview 1. The user selects multiple pattern catalogs or all pattern documents in the repository as the search space. 2. The system analyzes all pattern documents in the repository If there are patterns with an importance value of zero, the weighting for any r(the set of all pattern documents gathered) and extracts the pattern names, the names of the pattern catalogs they belong to, and the names of related patterns, and obtains the pattern relationships from the extracted data. If the same pattern appears with different names in the pattern documents, the system treats it as a pattern that can be referenced by any of the names that appear. 3. The user enters an arbitrary search query string into the system, appropriate to the pattern required. 4. The system performs the importance calculation on the collection of pattern documents, calculating the distribution ratio matrix, the pattern catalog distribution ratio matrix and the corrected distribution ratio matrix from the extracted pattern relationships. The result is an importance value for each pattern. 5. The system finds the pattern documents that contain the keyword string entered by the user. This is a simple full-text searching. 6. The system sorts the search results according to the importance values and presents them to the user. Patterns that have the same importance value are displayed in alphabetical order of the pattern name. Using the system, users can obtain results with a set of pattern documents that contain the search query and are sorted in order of importance, allowing them to find patterns efficiently that meet their objectives or provide useful references. For example, in Figure 10, a user needs to solve a problem related to generating an object. The user enters the search query, factory, to find a pattern that will provide a reference for solving this problem, and obtains a result like that shown in Figure 11. The proposed system presents Figure 11. System Overview 29 pattern documents, like the Abstract Factory, that contain the query string in order of importance as the search result. The first result, Abstract Factory, has several sources listed after from ; indicating that the pattern is published in several different pattern documents by several different authors. Also, the pattern name is a link to the Web site with the pattern, and the alternate sources listed under from are also links to the Web sites providing them. The proposed system can also present all of the patterns in the repository in order of the computed importance if no search query is entered. This provides an effective way for a user to examine and study all patterns in the repository. 4. Preliminary Experiment and Demo The proposed system was implemented and experiment was done to verify parameter values and evaluate the usefulness of the system Preliminary Experiments for Parameter verification Parameter verification experiment involved preliminary experiment to determine appropriate values for the various parameters used in the importance calculation of the PatternRank method. For experiment 1 to 3, we prepared answer set 1 with four patterns and answer set 2 with 12 patterns. Experiment 1: Determine an appropriate value for s in the importance-calculation convergence condition, w t w t 1 w t 1 s. Importance values for all patterns in each of the answer sets, 1 and 2, as well as the entire set of 131 pattern documents in the repository, gathered from the pattern catalogs from four Web sites [2, 5, 6, 18], were

11 computed using values of s starting from 0.1 and progressing through 0.01, 0.001, and so on. The value of the correction distribution ratio, r, was set to 0.15, the pattern catalog distribution ratio, c, to 0.15, and the calculation was done with no search query. In the experiment, the importance values were observed to stop changing when the value of s reached 10 5 for answer set 1, and 10 7 for answer set 2 and the set of 131 pattern documents. From the experiment, we conclude that the s value should be set to 10 8 or less for the importance value calculation to converge. Experiment 2: Determining an appropriate value for the correction distribution ratio, r. Importance values were calculated for the patterns in answer sets 1 and 2, setting r to values from 0.0 to 1.0. The stopping threshold, s, was set to 10 8, the pattern catalog distribution ratio, c, was set to 0.15, and the calculation was done with no search query. Large changes were observed in the resulting pattern importance values when the value of r was set to greater than 0.2. It is desirable that the effect on importance values from r be as little as possible, so we conclude that r must not be greater than 0.2. From the experiment we conclude that the value of r must be in the range 0.0 < r 0.2. Experiment 3: Determining an appropriate value for the pattern catalog distribution ratio, c. Importance values were calculated for the patterns in answer sets 1 and 2, setting the pattern catalog distribution ratio, c, to values from 0.0 to 1.0. The stopping threshold, s, was set to 10 8, the correction distribution ratio, r, was set to 0.15, and the calculation was done with no search query. The experiment results show that as the pattern catalog distribution rate is set higher, the importance of the three patterns in Catalog 1 decreases, while that of the other patterns increases. Also, from the results, it was clear that as the value of c changes, the patternimportance values change together according to which pattern catalog they belong to. The experiments confirmed that it is possible to change the importance of patterns according to catalog by changing the pattern catalog distribution ratio value. However, they did not indicate a clear appropriate value for c. c can be used to specify the importance of same-catalog pattern relationships Demonstration In demonstration 1 and 2, an existing Web site with a set of patterns was modified to perform searches using the proposed system. Pattern documents conforming to the search query were studied manually before hand and we used them as an answer set. We also investigated the amount of time required from the time the search query is entered to when the result is displayed for each demonstration (the program was run ten times and the average time in milliseconds was measured). The demonstration environment was a Pentium4, 3.2 GHz, 1.0 GB RAM PC with Microsoft Windows Home Edition. For the proposed method, we treat patterns as the same if the content of the pattern documents is the same. However, because it is difficult to determine whether the content of two documents is the same automatically, for the purposes of these experiments, we concluded that if the pattern names were the same, the patterns were also the same. Adopting this approach has two problems: (1) If there are patterns that are the same, but have different names, they will be treated as different patterns, and (2) If there are patterns that are different, but have the same name, they will be treated as the same pattern. In fact, however, most patterns are given a specific name that is particular to the problem area to which the pattern applies, so there are very few cases where patterns have the same name but different content. Among the 131 pattern documents used in demonstrations 1 and 2, there were 23 groups of patterns that were the same provided on multiple Web sites and in all cases where the name was the same, the content was also the same. In light of this, concluding that the patterns are the same if the pattern names are the same results in an easy and practical implementation and it can be expected to be reasonably correct due to the characteristics of pattern names. This approach was used in demonstrations 1 and 2. Demo 1: Assuming a scenario where a system developer requires patterns that can be used in designing GUI windows, we entered the search query window into the proposed system. For the demonstration, the search query window was entered against the set of 131 pattern documents in the repository, gathered from pattern catalogs on four Web sites [2, 5, 6, 18]. The pattern document set contains four pattern documents with the search query ( window ), and also two documents that do not contain the search query, but can be considered useful for GUI window design (Command pattern [2, 5], and Extensibility pattern [18]). The importance calculation was done using a stopping threshold, s, of 10 8, a correction distribution ratio, r, of 0.15, and a pattern catalog distribution ratio, c,

12 of The result is shown in Tables 3, respectively. The execution time was 156 ms. Table 3 shows the Decorator patterns typically used in GUI window design ranked highly by the method. For example, the Decorator pattern can be used when developing visual GUI components like scroll bars and frames. A few patterns related to network transport windows, such as the Receive Protocol Handler pattern also appeared in the results, but these received a lower rank because they had few references. We think the reason of lower rank of Abstract Factory pattern is because most of the patterns referencing Abstract Factory do not match the window search query. From the demonstration results, we can say that the proposed system can present the multiple-pattern result of a search query with no semantic information in a somewhat meaningful order. Without the ordering of results provided by the PatternRank method, users could end up starting their investigations with patterns like Transmit Protocol Handler or Dll Hell, which have low importance. The proposed system is promising as a way to improve software development efficiency by supporting the use of design patterns. Demo 2: In this demonstration, we assumed a scenario where techniques for embedded software to use memory efficiently are being studied, and a search is done using memory as the query on the proposed system. For the demonstration, the search query memory was entered against the set of 131 pattern documents in the repository, gathered from pattern catalogs on four Web sites [2, 5, 6, 18]. The pattern document set contains nine pattern documents containing the search query ( memory ), and also one document that does not contain the search query, but can be considered useful for improving efficiency of memory use in embedded software (Resource Manager pattern [6]). The importance calculation was done using a stopping threshold, s, of 10 8, a correction distribution ratio, r, of 0.15, and a pattern catalog distribution ratio, c, of The result is shown in Table 4. The execution time was 141 ms. In Table 4, several design patterns related to using memory efficiently, such as the Proxy and Flyweight patterns appear with high rank. For example, Virtual Proxy within the Proxy pattern is able to reduce memory use by not creating objects until they are needed. The Fly Weight pattern also saves memory by sharing it between objects of the same type that are used often. The Proxy pattern is referenced by many other patterns such as the Adapter and Decorator patterns, so it appears with high rank. In Table 4, several design patterns related to using memory efficiently, such as the Lazy Evaluation, Proxy, and Fly Weight patterns appear with high rank. For example, Virtual Proxy within the Proxy pattern is able to reduce memory use by not creating objects until they are needed. The Fly Weight pattern also saves memory by sharing it between objects of the same type that are used often. We think the reason of lower rank of Proxy pattern is because many of the patterns that reference the Proxy pattern are not in the set of patterns matching the memory search query. From the experiment results, the proposed system can present multiple patterns resulting from a search query from the user in a somewhat meaningful order. Without the ordering of results provided by the PatternRank method, users could end up learning about less important patterns first. The proposed system is promising as a way to improve the efficiency of learning development methods by supporting the study of design patterns. 5. Related Work Markus et al. have proposed a pattern search system that is limited to the field of security design [15]. By limiting the domain of the system, additional search conditions specialized to the domain could be added. However, our system is not limited to any particular domain and can handle any pattern. Kinashi et al. have proposed a tool that accepts a requirements analysis description and presents applicable design patterns to the user based on it [8]. With this tool, it is necessary to manually prepare requirements analysis description fragments and matching rules ahead of time for the design patterns being handled. However, our proposed system can handle any pattern in a design pattern, and beyond gathering the pattern documents, no additional manual work is required. The archetypal pattern repository on the Web is the Portland Pattern Repository [1]. It has a simple keyword-search function allowing entry of search queries, but the results are listed in order of pattern name, so this is essentially random from the perspective of usefulness of the patterns. The PatternRank method makes use of relationships between patterns to calculate an importance value for each element, drawing upon the concepts used in the Page Rank method [11] which calculates importance of Web pages, and the Component Rank method [7], which does so for software components. The Page Rank method uses the principle that a Web page referenced by an important Web page is also important. The Component Rank method uses the principle that components that are used often, or are used by important components are also important [7]. The Page

A Metric for Measuring the Abstraction Level of Design Patterns

A Metric for Measuring the Abstraction Level of Design Patterns Atsuto Kubo 1, Hironori Washizaki 2, and Yoshiaki Fukazawa 1 1 Department of Computer Science, Waseda University, 3-4-1 Okubo, Shinjuku-ku,