Discovery of High Confidence Association Rules in Remotely Sensed Images Without Regard for Support 12

Size: px

Start display at page:

Download "Discovery of High Confidence Association Rules in Remotely Sensed Images Without Regard for Support 12"

Virginia Dixon
6 years ago
Views:

1 Discovery of High Confidence Association Rules in Remotely Sensed Images Without Regard for Support 2 Stephen Krebsbach, Qiang Ding, William Jockheck, William Perrizo Stephen.Krebsbach@dsu.edu, {Qiang_Ding, William_Jockheck, William_Perrizo}@ndsu.nodak.edu Computer Science Department, North Dakota State University Fargo, ND 8, USA Abstract. Spatial data mining of Remotely Sensed Images (RSI) has become an important field of research as extremely large amounts of data are being collected from remote sources such as the Landsat satellite Thematic Mapper (TM) and other remote imaging systems. Association Rule Mining (ARM) has become an important method for mining large amounts of data in many areas beyond its originally proposed market-basket domain. The popularity of ARM comes from the well-known a-priori algorithm that exploits a user-specified minimum support (called minsup). Rules of interest are defined as only those lying within the set of rules that exceed this support level. To work efficiently, rules of interest need to be restricted to those that occur frequently. While this restriction enables a-priori based data mining to perform efficiently it rules out the discover of an entire class of rules of interest which are pruned for lack of support. Such a class of rules is of interest in applications such as those found in the agricultural domain where a rule of interest might address early insect infestation; a rule with extremely low support but of extremely high interest to a producer. In this paper, we develop a conceptual decision cube called a P-cube that is derived from a P-tree storage of remotely sensed images. This conceptual P-cube is then used to help develop an efficient algorithm for discovering high confidence rules using a precisionhierarchy approach. This approach discovers high confidence rules without concern for Patents are pending on the bsq and P-tree technology from which the P-cube is derived. 2 This work is partially supported by NSF Grant OSR-93368, DARPA Grant DAAH and GSA Grant ACT#: K96338.

2 Discovery of High Confidence Association Rules in Remotely Sensed Images Without Regard for Support Page 2 support. This algorithm does not suffer from the computational explosion inherent in the a- priori approach when considering low support threshold. This makes it feasible to discover a wide class of rules normally lost to pruning techniques in the name of efficiency. Keywords: Data Mining, Association Rule Mining, Remote Sensed Imagery (RSI), P-cube. Introduction Associated Rule Mining (ARM) has been one of the more successful models for knowledge discovery in databases (KDD). In association rule mining, the goal is to find all rules that fulfill two constraints, minimum support (minsup) and minimum confidence (minconf). In the paper by Hipp, Ulrich and Nakhaeizadeh [3], many of the classical ARM algorithms are surveyed and compared. It is pointed out by Cohen et al [4] that much of the success of ARM can be credited to the now well-known a-priori approach developed by Agrawal et al [,2] that allows for a straightforward pruning technique based on minsup. A- priori includes an efficient way to handle the computational explosion problem that develops as the number of items is increased. This approach has been shown to work very well when rules of interest are those that occur frequently; however, this successful pruning technique becomes less useful when the rules of interest happen very infrequently. Infrequent rules are pruned for lack of support if they fall below minsup. Lowering the minsup to capture these infrequent but interesting rules may reintroduce the combinational explosion problem caused by the generation of too many rules. This problem has been referred to as the rare item problem []. Different techniques have been proposed to address the deficiencies inherent in the a-priori approach when rules of interest with high confidence are sought regardless of support. Lui, Hsu, and Ma [6] propose a novel technique that allows the user to assign multiple minimum item supports to reflect the natures of the items and their varied frequencies in the database.

3 Discovery of High Confidence Association Rules in Remotely Sensed Images Without Regard for Support Page 3 Different rules can then have different minsup constraints. This approach requires the user to understand the nature of an item and how it interacts in the given domain. The usefulness of discovering rules with very high confidence and very little support is addressed in the paper by Cohen et al [4]. They develop a symmetric similarity measure to replace the traditional asymmetric confidence measure and remove the minimum support requirement. A combination of random sampling and hashing algorithms are applied to the problem of finding pairs of words that occur together infrequently, but with high confidence in news articles. Their results show the usefulness of high confidence very low support rules. Wang, Zhou, and He present a pruning technique based on minimum confidence as part of their work on turning association rules into classifiers []. They develop the Existential Upward Closure property to help prune high-confidence rules. In this paper, we address the discovery of interesting high-confidence rules in a particular Remotely Sensed Image environment. We develop a precision-hierarchy approach for the discovery of interesting rules using a structure called the P-cube. The P-cube is derived from the basic P-tree data structure [7]. An efficient algorithm is presented that finds all high confidence rules at a given level of value precision and prunes uninteresting rules. In section 2 we discuss the storage of remote sensed data and describe the basic Peano Count Tree (P-tree) data structure that provides a lossless compression of this data in a datamining ready format. We develop the concept of the P-cube and show how it can be efficiently derived. The reasoning for our approach and the algorithm itself is presented in section 3. A review of our contribution and thoughts on future work is discussed in section 4.

4 Discovery of High Confidence Association Rules in Remotely Sensed Images Without Regard for Support Page 4 2. Remotely Sensed Images and P-trees Remotely Sensed Images (RSI) are found in several types such as SPOT, TM, AVHRR, TIFF, and others. For our example, we will work with the TM (Thematic Mapper) format. TM scenes use seven reflectance bands Blue, Green, Red, Reflective-Infrared, Mid-Infrared, Thermal-Infrared, and Mid-Infrared2. Each band holds reflectance values from to 2. A typical TM scene contains 4M pixels where each pixel has seven values assigned to it; one from each band. Often other ground data is integrated such as geophysical, radiometric, magnetic, geochemical, mineral occurrence, and lithological data. In precision agriculture, we commonly have access to a map of yield levels for previous harvests. These types of data are commonly displayed using a color legend or gray-scale levels in an image. 2. Mining RSI Several different formats are used for RSI data. A TM scene, for example, uses the Band Sequential (BSQ) format where each band is stored as a separate file and raster order is used within each band. Pixels of each band are linked by their physical location in the file. This position is related to an actual latitude-longitude that is not stored in the bands but can be derived from the original file header. We can think of a location as being represented by the tuple (lat, long, B, B2, B3, B4, B, B6, B7, B8), where lat and long are derived attributes and B thru B8 hold band reflectance values in the range..2. Other ground data can be integrated as new attributes in the tuple. For simplicity in this section, we will include only one new attribute, Y (yield). Assume we are looking for sets of reflectance values in different bands that will imply a particular yield. Desirable rules might be of the form B [] Y [34] or B [8 ]^B3 [23] Y [2]. The antecedent may include several bands but the consequence will be limited to only one band, the attribute of interest. In many cases, the value for a single band will actually be an interval of contiguous values rather than a single reflectance and will be denoted as B [i..j]. Conceptually, we can envision a rule as a map

5 Discovery of High Confidence Association Rules in Remotely Sensed Images Without Regard for Support Page through band intervals that if taken will lead to a particular consequence interval as shown below. B B3 Y 2 Figure Concept of a rule It is important to note here that a rule antecedent can change not only in the number of bands included but also in the interval size of those bands. 2.2 P-trees and bsq Format The P-tree storage structure is based on a format called bit-sequential (bsq). Briefly, the bsq format breaks each of the seven TM bands into eight separate files by vertically partitioning the eight bits of each byte used to store the reflectance values. There are several reasons to use the bsq format. First, different bits have different degrees of contribution to the value. In some applications, we do not need all the bits because the high order bits give us enough information. Second, the bsq format facilitates the representation of a precision hierarchy. Third, and most importantly, the bsq format facilitates the creation of an efficient, rich data structure, the P-tree, and accommodates algorithm pruning based on a one-bit-at-a-time approach. 2.3 Basic P-trees Each bit file in the bsq format is stored in a tree structure, called a Peano Count Tree (Ptree). A P-tree is a quadrant-based tree. The idea is to recursively divide the entire image into quadrants and record the count of -bits for each quadrant, thus forming a quadrant count tree. P-trees are somewhat similar in construction to other data structures in the literature [8,9].

6 Discovery of High Confidence Association Rules in Remotely Sensed Images Without Regard for Support Page 6 For example, given an 8-row-8-column image, the P-tree is as shown below. P-tree / / \ \ / / \ \ / / \ \ / / \ / \ \ // \ // \ // \ PM-tree m / / \ \ / / \ \ / / \ \ m m / / \ / \ \ m m m // \ // \ // \ Figure 2. 8*8 image and its P-tree (P-tree and PM-tree) In this example, is the number of s in the entire image. This root level is labeled as level. The numbers at the next level (level ), 6, 8, and 6, are the -bit counts for the four major quadrants. Since the first and last quadrants are composed entirely of -bits (called a pure quadrant ), we do not need sub-trees for these two quadrants, so these branches terminate. Similarly, quadrants composed entirely of -bits are called pure quadrants which also terminate these tree branches. This pattern is continued recursively using the Peano or Z-ordering of the four subquadrants at each new level. Every branch terminates eventually (at the leaf level, each quadrant is a pure quadrant). If we were to expand all subtrees, including those for pure quadrants, then the leaf sequence is just the Peano-ordering (or, Z-ordering) of the original raster image. Thus, we use the name Peano Count Tree (P-tree). This structure provides compression and embedded information that is needed to do data mining. The performance of this structure is discussed in []. This mechanism creates eight basic P-trees which can be combined using simple logical operations (AND, NOT, OR, COMPLEMENT) to recover the original data or produce P- Trees at any level of precision for any value or combination of values. For example we can construct a P-tree (called a Value P-tree) for all occurrences of the value by ANDing basic P-trees (for each -bit) and their complements (for each bit): PC b, = PC b AND PC b2 AND PC b3 AND PC b4 AND PC b AND PC b6 AND PC b7 AND PC b8

7 Discovery of High Confidence Association Rules in Remotely Sensed Images Without Regard for Support Page 7 where indicates the bit-complement. The power of this representation is that by simple and operations we can construct all combinations and permutations of the data and that the resulting representation has the hierarchical count information embedded to facilitate data mining. Basic (bit) P-trees (i.e., P, P 2,, P 2,, P 88) AND Value P-trees (i.e., P, ) AND Tuple, range and other P-trees (i.e., P,,,,,,, ) Figure 3. Basic P-trees, Value P-trees (for 3-bit values) and other examples produced using the and operation. The actual implementation of the P-trees have been modified in order to optimize the AND operation. References to PM-tree (Pure Mask tree) are a reference to this variation. The PMtree, uses a 3-value logic to represent pure-, pure- and mixed quadrant. Details are available in [7]. Figure 4 shows the average time to perform AND operations [2]. Figure 4: Average time of AND operations

8 Discovery of High Confidence Association Rules in Remotely Sensed Images Without Regard for Support Page Extension to the Peano Count Cube (P-cube) For most spatial data mining, the root counts of the tuple P-trees (e.g., P (v,v2,,vn) = P,v AND P 2,v2 AND AND P n,vn ), are the numbers required, since root counts tell us exactly the number of occurrences of that particular pattern over the space in question. These root counts can be inserted into a data cube, which we call the P-tree Count cube (P-cube) of the spatial dataset. Each band corresponds to a dimension of the cube, the band values labeling that dimension. The P-cube cell at location, (v,v2,,vn), contains the root count of P (v,v2,,vn). For example, assuming just 3 bands, the (v,v2,v3) th cell of the P-cube contains the root count of P (v,v2,v3) = P,v AND P 2,v2 AND P 3,v3. The cube can be contracted or expanded by going up or down in the value concept hierarchy or projected (rolled up) onto any smaller dimensionality. While this may appear to be a major computational operation, this is simply a proposed data warehousing structure to facilitate data mining. There are two possibilities of construction. The first is to construct the tree during warehousing and actually store it. However, since earlier work [] indicates the and operation when done in parallel using an array of processors is fast, it may suffice to construct the cube on the fly from the original P-trees at the time the data mining is done. The choice is simply the end users classic option between speed and data storage. We can envision the P-cube as an n-dimensional data cube (Figure ). For clarity of notation to follow, we work in two or three dimensions but there is not limit to the dimensionality.

9 Discovery of High Confidence Association Rules in Remotely Sensed Images Without Regard for Support Page 9 T 7 B B B B B B-3 Figure : Representation of the P-cube in three dimensions. Here, the location (,,) has value. This means (looking at the lower left corner of the cube) that there are no entries in the data where the values of bands,2 and 3 are all zero. (Here we have use the term bands since the data of interest are the image bands from satellite or similar imagery.) The summation on each face (the number in the lower right) indicates the projection onto that face (roll up). Hence, looking at the upper left of the front face there are no values for which B, B2 and B3 are,, but there are twenty for which B is and B2 is. Having constructed (or build on the fly) this cube, we have all the information

10 Discovery of High Confidence Association Rules in Remotely Sensed Images Without Regard for Support Page we need to mine rules from the data. Since all of the data is present, we need not limit ourselves to high support rules. While this structure can be used to locate high support rules, there are a myriad of algorithms to extract high support rules. Instead, we are intent on locating the high confidence rules without regard to support. 3. Mining approach and algorithm Mining of high-confidence, low-support association rules has been shown to be challenging but useful in several domains [4, ]. In this section, we present a pruning algorithm (Figure 6) for finding all high-confidence rules in a RSI environment regardless of support. It exploits the precision hierarchy of the band values captured in the P-tree structure. A P-cube of attribute bands can then be built and efficiently mined. The algorithm allows the user to decide on the precision (the size of the band intervals) from one to eight bits. First, the P-cube is generated for all bands of the antecedent at the requested precision; next all high confidence rules are generated; finally, the rules are pruned to eliminate those of low interest. 3. Building the P-cube. The basics of the P-cube have been discussed in section 2. A complete presentation of P- Trees and the building of P-cubes can be found in []. For our example, we assume the P- cube will be built on the fly. In many cases, a complete TM scene will not be of interest to the user. In that case, the P-tree data structure s unique spatial characteristics allow us to quickly segregate the physical region without rescanning the data. 3.2 Generating all high confidence rules. The algorithm finds all confidence rules at the given precision and confidence level. While generating these rules, the question of band interval sizes of both the antecedent and consequence must be addressed.

11 Discovery of High Confidence Association Rules in Remotely Sensed Images Without Regard for Support Page Antecedent size: Once a precision is selected, only antecedent bands intervals of that precision need to be considered. If the rule meets or exceeds the confidence threshold, it is accepted. If two high confidence rules have contiguous band intervals, then together they generate a high confidence rule. They can be generalized later. If neither of two rules with contiguous band intervals meets the high confidence level then combining their intervals can not lead to a rule of high confidence. For example, assume the low confidence rule B [] B2 [] is generated, Theorem proves that no expansion of the antecedent interval (such as B [..] B2 [] ) will lead to a rule of higher confidence if the rule B [] B2 [] is also of low confidence. Theorem : If the confidence of the rules A [i] C [k] and A [j] C [k] (A [i] A [j] = Φ ) are below the threshold of confidence, then the confidence of rule A [i] A [j] C [k] is also below the threshold. Proof: Let θ be the threshold of confidence. Let i, j, k be intervals. A [i] is read as the total number of pixels where the value of A is within the interval i. We have precondition Conf(A [i] C [k] ) = A [i] C [ik] / A [i] θ, and Conf(A [j] C [k] ) = A [j] C [k] / A [j] θ. So Conf(A [i] A [j] C [k] ) = (A [i] A [j] ) C [k] / A [i] A [j] <A [i] A [j] = Φ> = ( A [i] C [k] + A [j] C [k] ) / ( A [i] + A [j] ) (θ A [i] + θ A [j] ) / ( A [i] + A [j] ) = θ Consequence size: Dealing with the consequence interval size is not as straightforward. The rule B [] B2 [] may not meet the confidence threshold while the rule B [] B2 [..] could. How do we decide how far to expand the consequence band when looking

12 Discovery of High Confidence Association Rules in Remotely Sensed Images Without Regard for Support Page 2 for high confidence rules? The interval is bounded by the precision on the low end and the complete band on the upper end. To find all high confidence rules we would allow the interval to grow to its upper bound if necessary. We might however want to allow the user to decide what the interest is of rules that contain large consequence band intervals. If they decide that interesting rules would only have intervals at the size of the precision, the algorithm can reflect that in the rules generated with resulting efficiencies. 3.3 Algorithm to find all high confidence rules There are two ways to build the P-cube, one is to build, then roll up; another is to build the P-cube on the fly. Our algorithm is based on the second method. Inputs: ABset = Set of all antecedence bands. p = the number of bits in the precision hierarchy () For each Band in the ABset construct value P-trees from basic P-trees using precision p. end for (2) Build the value P-Trees for the consequence band using precision p. (3) For each combination of bands in the ABset (3a) Build the antecedent + consequence P-cube (3b) Roll up on the antecedent set intervals (3c) For each non-zero antecedent rollup-count interval Use consequence size combining to generate high-confidence rules (see 3.2) End for End for Figure 6: Algorithm for finding high confidence rules In step we choose only the P-trees of precision p of the bands in the ABset to generate the corresponding value P-trees. In the same way we then generate the value P-trees for the single consequence band (step 2). Step 3 then generates all high confidence rules. To do this

13 Discovery of High Confidence Association Rules in Remotely Sensed Images Without Regard for Support Page 3 we must look at each combination of bands in the ABset. For each combination we do the following. Build the P-cube where the dimensions are the bands in that combination and the consequence (3a); roll up the antecedent dimensions for each interval to get the rollup counts (3b); selecting the non-zero rollup-counts we calculate the confidence and identify the high confidence rules and combine them based on the appropriate consequence size (3c). 3.4 Pruning for Interesting Rules and Reasonable Rule Sets When discovering rules with high confidence it is likely that the user will be presented with many uninteresting rules. First, it is important to try to present to the user only those high-confidence rules that are also of interest. To that end, we must develop a definition of interest for bit precision based association rules. The bit precision determines the size of the intervals used to generate the P-cube. One bit will divide the reflectance band into 2 intervals, -27, Two bits into 4 intervals, three bits into 8 intervals, etc. In the same way, it determines the size of the consequence intervals. Once the bit precision is set, smaller intervals are excluded; however, larger intervals may give more general and possibly more interesting rules to the user. Precision Based Misleading Rules: Assume when using 2 bit precision, the following 2 high confidence rules are generated: B [] B2 [], B [] B2 []. In this case we would want to generate a more general rule: B [..] B2 []. The question becomes how general, that is, how large can the antecedent band grow before it becomes non-interesting or more importantly, misleading. Figure 7 shows a 2-bit, 2-band example where three intervals are combined to create a more general, high confidence rule. This is a misleading rule because it has lost precision. If the more general rule is broken back down to the requested precision, a non-confident rule is produced. This leads to a rule for growing antecedent

14 Discovery of High Confidence Association Rules in Remotely Sensed Images Without Regard for Support Page 4 intervals: Only contiguous intervals of high confidence within a band should be combined to create a more general rule. B Y % 8% 98% 24 97% Figure 7: 2-band, 2-bit misleading interval combination. In addition, it is possible with this approach to produce redundant rules. In any generation of rules, it is desirable to produce a comprehensible number of rules. If too many rules are produced for the user to handle, any useful information may be indistinguishable in the volume. As a result, it may be necessary to identify and eliminate (or suppress) redundant rules. Our approach does not suggest any new methods for redundant rules. Instead, we simply refer to []. Using these techniques, rules that are subsets of another rule are suppressed as redundant. In addition, our method allows for the identification of all rules regardless of support (number of occurrences). Rules based on small occurrence sets may represent noise in the data. It is possible to create a high confidence rule based on a single sample in the data cube. These rules should be marked or segregated to be distinguishable from other rules. When the number of rules created is large, it may be convenient to suppress these insignificant rules unless the objective is to locate statistical outliers. 4. Conclusion Using the concept of the compact, data mining ready P-tree construct and its extension into a P-cube, we have presented an approach for locating the elusive high confidence, low

15 Discovery of High Confidence Association Rules in Remotely Sensed Images Without Regard for Support Page support rules. The specifics have focused on the use of remote sensed data, specifically satellite data but should be extendable to other formats. By relying on the theorem proved in the paper we can avoid checking unnecessarily for confidence. The theorem is the confidence equivalent of the support theorem on which a-priori is based, that is all elements of a frequent item set must be frequent. By using this theorem, we have been able to create a reasonable data warehousing approach for spatial data that will allow the location of all rules regardless of their support. Future work will focus on honing the pruning sequence to minimize rule volume. In addition, we hope to construct specific data warehousing software that will employ this data structure and theorem.

16 Discovery of High Confidence Association Rules in Remotely Sensed Images Without Regard for Support Page 6 References [] R. Agrawal, T. Imielinski, and A. Swami, Mining Association Rules Between Sets of Items in Large Database, Proceedings of the ACM SIGMOD Conference, 993, pp [2] R. Agrawal and R. Srikant, Fast Algorithms for Mining Association Rules, Proceedings of the 2 th International Conference on Very Large Databases, 994. [3] J. Hipp, G. Ulrich, and G. Nakhaeizadeh, Algorithms for Association Rule Mining A General Survey and Comparison, ACM SIGKDD, July 2, Vol. 2, Issue, pp [4] E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P.Indyk, R. Motwani, J. Ullman, and C. Yang, Finding Interesting Associations without Support Pruning, Proceedings of the 6th Annual IEEE Conference on Data Engineering (ICDE 2), Feb. 2. [] Mannila, H. Database Methods for Data Mining, KDD-98 tutorial, 998 [6] B. Lui, W. Hsu, and Y.Ma, Mining Association Rules with Multiple Minimum Supports, ACM SIGKDD International Conference on Data Knowledge & Data Mining (KDD-99) [7] "On Mining Satellite and Other Remotely Sensed Images", William Perrizo, Qin Ding, Qiang Ding and Amalendu Roy, Proceedings of ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, May 2, pp [8] H. Samet. The quadtree and related hierarchical data structure. ACM Computing Survey, 6, 2, 984. [9] HH-code. Available at [] William Perrizo, "Peano Count Tree Technology", Technical Report NDSU-CSOR-TR- -, 2. [] Ke Wang, Senqiang Zhou and Yu He, Growing Decision Trees on Support-less Association Rules, KDD 2, Boston, MA. [2] Amalendu Roy, Thesis on Ptrees and the AND operation North Dakota State University, 2,

On Mining Satellite and Other Remotely Sensed Images 1, 2

On Mining Satellite and Other Remotely Sensed Images 1, 2 William Perrizo, Qin Ding, Qiang Ding, Amalendu Roy Department of Computer Science, North Dakota State University Fargo, ND 5815-5164 {William_Perrizo,