Enabling Far-Edge Analytics: Performance Profiling of Frequent Pattern Mining Algorithms

Size: px

Start display at page:

Download "Enabling Far-Edge Analytics: Performance Profiling of Frequent Pattern Mining Algorithms"

Lorena Sims
5 years ago
Views:

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI.9/ACCESS.27.

, JANUARY 27 Enabling Far-Edge Analytics: Performance Profiling of Frequent Pattern Mining Algorithms Khubaib Amjad Alam, Rodina Ahmad,and Kwangman Ko Abstract Far-edge Analytics refers to the

Far-edge analytics enables data reduction in mobile environments, hence reducing the data transfer rate and bandwidth utilization cost for mobileedge communication.

1 This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI.9/ACCESS , IEEE Access JOURNAL OF L A TEX CLASS FILES, VOL. 6, NO., JANUARY 27 Enabling Far-Edge Analytics: Performance Profiling of Frequent Pattern Mining Algorithms Khubaib Amjad Alam, Rodina Ahmad,and Kwangman Ko Abstract Far-edge Analytics refers to the enablement of data mining algorithms in far-edge mobile devices that are part of Mobile Edge Cloud Computing (MECC) systems. Far-edge analytics enables data reduction in mobile environments, hence reducing the data transfer rate and bandwidth utilization cost for mobileedge communication. In addition, far-edge analytics facilitates local knowledge availability to enable personalized mobile data stream mining applications. Existing literature mainly addresses classification and clustering problems in far-edge mobile devices, but the problem of Frequent Pattern Mining (FPM) remains unexplored. This article presents the results of an experimental study on the performance profiling of frequent pattern mining algorithms. We developed a real mobile application for performance analysis and profiling of 2 FPM algorithms with various real datasets in terms of execution time, storage complexity, sparsity, density, and dataset size. According to the experimental results, large-sized datasets with high sparsity increase computational and storage cost in far-edge mobile devices. To address these issues, we propose a framework and discuss the relevant research challenges for seamless execution of FPM algorithms in MECC systems. Keywords association rules; data mining; far-edge analytics; frequent itemsets; mobile cloud computing I. INTRODUCTION Mobile edge cloud computing is an emerging research area in mobile cloud computing []. MECC systems facilitate the provision of conventional cloud computing services through edge servers that include cloudlets, micro data centers, and smart routers, amongst many others [2], [3], [4], [5], [6], [7]. Edge servers provide networking, computing, and storage services at one-hop wireless distances from mobile devices, such as smartphones, wearable devices, mobile Internet of Things (IoTs), and body sensor networks [8], [9]. In the present study, these mobile devices are referred to as faredge mobile devices. Edge servers facilitate minimizing the latency and prolonging the battery lifetime of far-edge mobile devices. However, continuous data transfers in edge servers increase bandwidth utilization cost and dependency on Internet connections. Alternately, the increasing computational power in far-edge mobile devices and close proximity of on-board data sources and computational resources like memory and CPU reduce latency. In addition, utilizing on-board computational resources in mobile devices allows the reduction of raw data transmission, hence minimizing bandwidth utilization Khubaib Amjad Alam and Rodina Ahmad are associated with the Department of Software Engineering, University of Malaya, Kuala Lumpur, 563, Malaysia.( s: khubaibalam@siswa.um.edu.my, rodina@um.edu.my). Kwangman Ko is associated with Department of Computer Engineering, School of Computer and Information Engineering, Sangji University, Republic of South Korea. ( kkman@sangji.ac.kr) costs. Far-edge analytics refers to the provision of knowledge discovery and data mining functionalities in mobile applications through far-edge mobile devices []. A far-edge analytics system, as depicted in Figure, is based on a mobile device such as a smartphone to collect and aggregate heterogeneous data from multiple on-board and off-board data sources [], [2]. On-board data sources include sensing elements, e.g. cameras, accelerometers, compasses, microphones, touch screens, etc. Similarly, edge analytics systems gather data from device-resident files maintained for resource status information, application logs, Wi-Fi and Bluetooth scans, to name a few. Conversely, off-board sensing elements include a variety of wearable sensors that gather data to recognize ambulatory activities, monitor physiological bio-markers, and sense external environments. In this study, a smartphone-based far-edge analytic system is envisioned that provides maximum execution support for knowledge discovery and data mining operations using on-board computational resources. Fig.. Local Storage Data Collection Local DB Far-Edge Mobile Device Device Logs FPM in Far-edge Device Data Sources Sensor data stream User-interactions Pre-processing Frequent Pattern Mining Association Rule Establishment Application Logs This article contributes in multiple ways as follows. A detailed literature review is presented of selected FPM algorithms proposed for mining different types of frequent itemsets, including association rules, rare itemsets, closed itemsets, high utility itemsets, erasable itemsets, relevant and useful itemsets, minimal non-redundant association rules, top-k association rules, and top-k non-redundant association rules. Subsequently, an android mobile application is developed for performance profiling of the selected algorithms. Experiments were con (c) 26 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See

2 This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI.9/ACCESS , IEEE Access JOURNAL OF L A TEX CLASS FILES, VOL. 6, NO., JANUARY 27 2 ducted with real datasets from the UCI repository and three mobile devices, including two smartphones and one tablet PC. In addition, the experimental results are analyzed in terms of dataset memory space consumption, execution time, sparsity, density, and size. We selected the highly compute intensive frequent pattern mining algorithms in order to put maximum computational load on mobile devices. Considering the results of batch data mining algorithms, we proposed a theoretical framework in section 5, which will not only support the stream mining data but also ensure the context-aware and adaptive execution of frequent pattern mining algorithms in mobile environments. The key concerns relevant to FPM in far-edge mobile devices are articulated and a generic framework is proposed for future study in this important research area. A few highly related works are presented in Section 2. A detailed discussion of selected algorithms comprises Section 3 along with the data collection and experimental setup details. Section 4 entails results and discussion while Section 5 presents a generic framework for handling the revealed issues. The article concludes with Section 6. II. RELATED WORK Far-edge analytics systems are required to consider various resource constraints, such as battery power, on-board storage, memory, and the CPU for successful execution of knowledge discovery and data mining algorithms [3], [2]. These constraints hindered the exploitation of conventional data mining algorithms in earlier systems. Numerous studies focusing on far-edge analytic systems are presented in recent literature. For example, researchers have proposed StreamAR, which mines sensor data streams using a cluster-based classification approach [4]. In addition, Mobile WEKA, an android-based general data mining platform provides numerous clustering, classification, and association-rule mining algorithms [5]. Similarly, Open Mobile Miner (OMM) was developed as an adaptive data mining system that provides a library of lightweight data mining algorithms [6]. Moreover, CARDAP is an extension of OMM for deploying the context-aware realtime data analytics platform by integrating far-edge analytics with Fog cloud computing services[7]. To cope with the problem of limited resources, the Pocket Data Mining (PDM) framework was developed as an agent-based data mining framework [8]. PDM utilizes local devices to form a peer-to-peer network and execute data mining tasks collaboratively. Hence, the successful development of these far-edge analytics systems is evidence of the adoption of faredge mobile devices as data mining platforms. Current faredge analytics systems function either in adaptive mode, which enforces the compromise over knowledge pattern quality, or by enabling light-weight and domain-specific algorithms. Therefore, a need has emerged for generic far-edge analytics systems that handle conventional heavy-weight data mining algorithms. Frequent pattern mining is a key research area in knowledge discovery and data mining. Formally, FPM is basically applied over I(set of items):{i,...,i n } and T(set of transactions): {t,..,t n } where T I [9], [2]. Transaction ID (TID) is used to uniquely identify a transaction in database. T contains A(a set of items) iff A T. The association rule (AR): A B over two itemsets A and B exists iff A I and B I and A B =. The rule A B contains the transactions with minimum support minsup for A B and confidence minconf for A B. Moreover, for a given set of transactions D, the rule for minimum confidence (minconf ) and minimum support (minsup) are specified by users and all rules that support minconf and minsup are generated for D resultantly. The itemsets and their ARs vary in simple, closed, maximal, rare, sporadic and utility based itemsets. Maximal frequent itemset as M = {I L I L, I I }, frequent closed itemset as FC = {C I C = h(c) support(c) minsup}, and maximal closed frequent itemsets are defined as following: MC = {C F C C F C, C C }. Here, h(c) represents closure of C base on Galois (concept) connection [2]. Most of the FPM algorithms generate quality knowledge patterns when maximum amount of data is kept in memory. Keeping in view this constraint of FPM algorithms, we performed the performance profiling presented in this article. A. Algorithms III. METHODS A variety of FPM algorithms are proposed in literature for closed, maximal, rare, sporadic, and utility based items. Moreover, AR mining algorithms are also available to find the direct and indirect association between underlying itemsets. Keeping in view, the large variety of FPM algorithms, we carefully selected 2 classical algorithms, enlisted in Table- I, for performance profiling in strictly mobile environments. The basic reason for selecting batch data mining approach is the lack of privision of dedicated memory in mobile deviceswhich is the major requirement of stream mining algorithms. This bottleneck hinders the applicability of mobile devices as data stream mining platforms. Another major reason for selecting batch data mining is to assess the worst case performance of mobile devices with maximum computing load which, if somehow deployed, are not promised by data stream mining algorithms as they work in forgetting mechanism and only operate using current data in hand. We selected Apriori and AprioriTid algorithms that are used for itemset mining at basic level and establish ARs later [9]. The mining process is: [ large itemset A, {find non-empty subsets of A}] [ non-empty subset B, {output (rule): B (A-B)} if the ratio [support (A)/support (B)] minconf ]. Apriori and AprioriTid are multi-pass algorithms where candidate items satisfying minsup are found and processed iteratively to generate candidate itemsets. This iterative procedure continues until there is no large itemset found. Considering, the subset of any large itemset is also large, helps in joining large itemsets with k items and deleting those subsets with no large itemsets and forming large itemsets with k-items. Apriori algorithm works by counting items to determine large itemsets. Subsequently, it works in two phases. For example for pass k, Apriori gen function is used to generate candidate itemsets C k, which works on input from (k ) th pass using large itemset of k i.e. L k. Next (c) 26 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See

3 This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI.9/ACCESS , IEEE Access JOURNAL OF L A TEX CLASS FILES, VOL. 6, NO., JANUARY 27 3 the support of C k is counted after a database scan; here C k from a transaction are needed to be determined efficiently for fast counting. A subset function based on hash-tree is used to store C k where leaf nodes of the tree contains itemsets list and internal nodes stores hash tables. The cost of keeping whole tree in memory is reduced by counting C k+ at k th pass. This strategy works when cost of scanning database is more than cost of keeping in memory and counting additional C k+ - C k+ candidates. Alternately, AprioriTid is different than Apriori in accessing database for one time in counting minimum support after first pass. The candidate itemsets are encoded after first pass to reduce the size of data and reading efforts at later passes. AprioriTid generates C k using Apriori-gen function before the pass begins but the support is counted without accessing database. The set C k contains the items of the form < T id, {I k } > where each I k represents a large k itemset with transaction identifier (Tid). Another optimization of Apriori was proposed by introducing Pascal which is based on pattern counting inference [22]. The proposed scheme counts some of the frequent and infrequent items and determines the support of other items on the basis of obtained frequencies. This strategy reduced repetitive database scans. The increasing computational cost of candidate generation in large itemsets especially with long patterns has led to the development of FP-growth algorithms [23]. The algorithm is based on frequent-pattern (FP) tree to store important information about frequent patterns in compressed form. FP- Growth algorithm performed efficiently due to three-step strategy: a) database is compressed and stored in F P tree to avoid multiple database scans at later passes, b) a patternfrequent growth scheme is used to reduce the cost of candidate generation in long patterns, and c) a divide n conquer approach is used to confine the search space and reduce the cost of level-wise search as used in Apriori algorithms. Recursive Elimination (Relim) algorithm was inspired by FP-growth but works without prefix trees [24]. Here, the items are counted and compared with minsup value given by the user at first and then arranged in ascending order for fast processing. It is to be noted that the core of Relim is a recursive function that has less computational cost than Apriori algorithms. The recursive function works by eliminating infrequent items from the transactions then selecting transactions containing least frequent items. Subsequently the least frequent items are deleted from selected transactions and the procedure starts again until there remains only frequent large itemsets. The issues of multiple iterations, data-skew and complicated internal data structures led to poor data locality and additional computational and storage requirements. Eclat [25], resolved these issues using three strategies: a) used vertical T id list format database for efficient enumeration, b) used lattice/sublattice approach for search space decomposition, and c) decoupled the pattern search from decomposition problem. Furthermore, each sub-lattice was enumerated using bottom-up search strategy. Eclat was effective for long itemsets. The scalability of Eclat is affected due to high memory requirements during intermediate results. A variant of Eclat, called declat, based on diffsets, was proposed to handle these issues [26]. The dif f sets keep track of differences between class member s tidset and prefix tidset hence optimize intersection operation. The declat was not suitable for very long pattern mining due to intensive memory and computation requirement to keep diffsets. We selected AClose for frequent closed itemset mining [2]. AClose uses the approach of limiting search space to closed itemset lattice, as an alternate of subset lattice. This approach helped in limiting AR without information loss and is very useful for densely correlated data. Although FC, found earlier, determines the exact frequency but the issue of traversing all itemsets efficiently was solved by Charm algorithm [27]. Charm used IT-tree (Itemset-Tidset tree) to explore both itemset and transaction space at the same time. The traversals were made using hybrid search method that identified FCs but skipped many levels as compared with earlier search methods which traversed at each level and all possible subsets. The algorithms also used a hash-based strategy to remove non-closed itemsets. Yet, Charm used dif f sets to compute intermediate results but linear scalability was still an issue. This challenge of optimizing intermediate results was addressed by dcharm, which eliminated branches and established relationships among several dif f sets for itemset growth [26]. Furthermore, we selected DefMe to mine minimal patterns that are another form of condensed patterns. DefME used criticalobjects for fast minimality checking in depth-first search but redundancy is an issue that needs to be addressed. Minimal non-redundant rules are lossless and informative. These rules are best choice for the rules with same information and same support. Zart, based on Pascal, was introduced to mine non-redundant ARs [28]. The algorithms works in three steps: a) finds frequent itemsets and marks frequent generators, b) filters FCs, and c) associates generators to their closures. The strength of Zart was its ability to track generators, which are the closed itemsets with same support but their subsets have different support. Most of the literature covered frequent pattern mining techniques highlighting the need to investigate rare pattern mining algorithms that are very useful in some specific application areas. For example, health-related application to mine rare symptoms of a particular disease from Electronic Health Records (EHRs). The AprioriRare algorithm is selected to assess the performance of such algorithms in mobile environments [29]. The algorithm works in two steps: a) to traverse in frequent itemsets and b) enlist the rare itemsets. The algorithms have potential limitation of storing all rare itemsets which could be more challenging during establishment of ARs. Hence, the need for compact storage scheme arises. The notion of finding rare items and frequent patterns simultaneously caused rare item problem which occurs due to two facts a) usually one minsup value is set for all itemsets by assuming that all frequent and rare itemsets comply with settled threshold which is a rare case in real world applications and b) minsup has to be set very low which cause large candidate sets and need exceptionally large memory. Hence, we selected MISApriori [3], which handle rare item problem by setting multiple minsup values to find frequent and rare itemsets at the same time. Moreover, the literature reports that (c) 26 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See

4 This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI.9/ACCESS , IEEE Access JOURNAL OF L A TEX CLASS FILES, VOL. 6, NO., JANUARY 27 4 TABLE I. SELECTED ALGORITHMS No. Algorithm Objective Year Authors Apriori ARs 994 R. Agrawal and R. Srikant 2 AprioriTid ARs 994 R. Agrawal and R. Srikant 3 FP-Growth ARs 2 J. Han et al 4 Relim Itemsets 25 C. Borgelt 5 Eclat Itemsets 2 M. J. Zaki 6 declat Itemsets 23 M. J. Zaki and K. Gouda 7 Charm ARs 22 M. J. Zaki and C.J. Hsiao 8 DefMe Itemsets 24 A.Soulet and F. Rioult 9 Pascal Itemsets 2 Y. Bastide et al Zart Minimal non-redundant ARs 26 L. Szathmary AprioriRare Rare Itemsets 27 L. Szathmary 2 Apriori Inverse Sporadic ARs 25 YS Koh and N Rountree 3 U-Apriori Itemsets 27 CK Chui et al 4 Two-Phase High Utility Itemsets 25 Y. Liu et al 5 VME Erasable Itemsets 2 Z. Dong and X. Xu 6 MISApriori ARs 999 B. Liu et al 7 IGB Relevant and Useful ARs 24 Gh. Gasmi et al. 8 MNR Minimal Non-redundant ARs 998 M. Kryszkiewicz 9 AClose Closed Itemsets 999 N. Pasquier et al 2 TopKRules Top-K ARs 22 P. Fournier-Viger et al 2 TNR Top-K Non-redundant ARs 22 P. Fournier-Viger and VS. Tseng CFPGrowth is efficient than MISApriori [3]. Yet it could be deployed in mobile environments but the issue is the input requirements that all itemsets should be ordered and minimum supports of each item should be explicitly defined in a separate file. These two bottlenecks restricted us to select this algorithm for automated analysis in our test application. Another exceptional case is sporadic rules mining where some rules have very low support but have high confidence. The minsup needs to be set very low in Apriori algorithm to mine the sporadic rules. We selected AprioriInverse to handle such situation [32]. The algorithm works by setting maximum support (maxsup) in the place of minsup and ignores all itemsets above maxsup threshold. The algorithm is flexible enough to mine two kinds of sporadic rules: a) perfect sporadic rules with items support strictly below maxsup and b) imperfect sporadic rules with items having flexible support value that may be greater than maxsup. In some cases, AprioriInverse is not able to find imperfect sporadic rules. The process of frequent itemset mining is totally dependent upon support values, which is not an issue in complete datasets with prior information. The definition of support values is challenging with uncertain datasets without prior information. Hence, expected support value is given to handle the probabilistic nature of the data. We selected U-Apriori which was proposed to find frequent itemsets from uncertain data [33]. The algorithm works differently by incrementing support count by the product of existential capabilities of all items a A instead of incrementing by one as in the case of Apriori. Furthermore, the algorithm handles insignificant candidate support using trimming technique which trims all items with low existential probabilities hence the size of dataset is reduced. High utility itemsets are another form of frequent patterns discovered to find the association between high-value (weightage) items. For example, in m-commerce settings, the items in a store that generate most revenue are considered to be high utility items. The two-phase algorithm addresses the similar issues [34]. The algorithm works by defining utilization weightage of each transaction in first phase and maintains downward closure property to find candidate sets with only combination of high utilization weightage items. Some low utility itemsets are also estimated at this phase but no itemsets is underestimated. In the second phase, an extra database scan is performed and overestimated itemsets are filtered out. The two-phase algorithm tends to be more costly than other frequent itemset mining algorithms because of multiple database scan and construction of high-utilization weightage model for each transaction. Yet, an efficient algorithm, FHM, is reported in algorithm but it is not in the scope of this study [35]. Conversely, erasable itemsets contain frequent items with low utility in specific scenarios. Mining and managing erasable itemsets can increase the efficiency of overall system. For example, mining of infrequent components and maintaining their stock in manufacturing industry is helpful in managing overall budget. We selected VME, as an alternate of META which is less efficient and scans database repeatedly [36], [37]. In addition, the inability to automatically prune irrelevant data is another weakness of META. The VME algorithm tracks the ids of products in an itemset using PDI list, which is a new data representation method. VME algorithm uses union operators on product ids and prunes irrelevant data automatically. In addition with itemset mining algorithms, we have selected six ARs mining algorithms that include FP-Growth, IGB, AprioriInverse, MNR, TopKRules, and TNR. FP-Growth algorithms, as discussed earlier, uses minsup and minconf threshold to mine frequent itemsets and generate ARs. The usefulness and relevance of ARs become key attributes when mining frequent patterns from large datasets. We selected a two-step IGB (Informative and Generic Basis of Association Rules) ARs algorithm [38]. In first step, IGB finds closed itemsets and their associated generators using Zart algorithm and in second step the algorithm generates ARs. The IGB set of ARs is the set of associations of the form A B-A, where A is a minimal generator of B, and B is a closed itemset having (c) 26 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See

5 This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI.9/ACCESS , IEEE Access JOURNAL OF L A TEX CLASS FILES, VOL. 6, NO., JANUARY 27 5 a support higher or equal to minsup, and the confidence of the rule is higher or equal to minconf. Similarly, to find perfectly sporadic ARs, we selected AprioriInverse, discussed earlier [32]. AprioriInverse generates an AR with confidence minconf and support of non-empty subset of A B is lower than maxsup value. AprioriInverse works in two steps a) perfectly sporadic itemsets are mined using minsup and maxsup values and b) perfectly sporadic ARs are generated in accordance with minconf applied over itemsets found in first step. The size of underlying dataset plays a critical role in AR discovery. The problem of large ARs also exists in large datasets like large candidate itemsets and long frequent itemsets. The representative association rules (RR) were introduced to address this issue [39]. The RR represents a smallest set of rules that covers all ARs. Here, a cover operator was introduced to generate those ARs which are not RRs. The cover operator works without database scans. The selected algorithm, Minimal Non-redundant Rules (MNR), selects a compact and lossless set of minimal non-redundant ARs. It first finds closed itemsets and their associated generators using Zart algorithm. Then, it generates minimal non-redundant ARs. Finally, we selected two algorithms, TopKRules and TNR, to mine Top-K ARs [4], [4]. The users set k value instead of minsup value and TopKRules algorithm generates topk ARs resultantly. The algorithm generates top-k rules by setting minsup= and user-given minconf value. TopKRules increments minsup with support value of the lowest AR from top-k rules, found earlier, and prunes the search space. The algorithm traverses the new search space and repeats the process again. It sets the new minsup and prunes the search space. The process keeps-on repeating until there is no candidate rule remains for top-k ARs. Yet, the TopKRules algorithm is very effective but it may generate redundant ARs. Hence, we selected TNR that removes redundancy and returns nonredundant top-k ARs. The reported performance analysis show that TopKRules performed efficiently than TNR. After carefully selecting the above algorithms, we selected some relevant datasets. The details of these datasets are presented in next subsection. B. Data collection The collection of relevant datasets for performance profiling of studied algorithms is of prime importance. We have selected nine datasets related to Market Basket Analysis. The details of datasets and corresponding attributes (size, number of transactions (Tx), sparsity, density, minimum acceptable minsup, candidate counts, and itemset counts) with minimum minsup threshold are presented in Table- II. The datasets are downloaded from sequential pattern mining framework [42] library and UCI machine learning repository [43]. The selected datasets are of two types: a) test datasets and b) real datasets. The test datasets are small sized datasets which are the reduced variants of some datasets used in the studied algorithms. Conversely, two real datasets, Mushroom and Retail are selected in this study. Mushroom is a real world dataset with 824 transactions and retail contains 8863 transactions. The notion for selecting these datasets is to compare the performance of studied algorithm on medium and large-sized datasets. It should be noted that we used ContextPasquier99 to test Apriori, AprioriTid, FP-Growth, Relim, Eclat, declat, A-Close, and Charm. Similarly, ContextZart to test DefMe, Pascal, Zart, AprioriRare, Closed ARs, and MNRs. While, UApriori was tested with ContextUncertain, two phase with DB Utility, and VME with ContextVME. Moreover, ContextInverse was used to test AprioriInverse. Further, ContextIGB was used to test MISApriori, FP-Growth ARs, IGB, TopKRules, and TNR. Finally, all algorithms were tested using real datasets that are Mushroom and Retail as discussed earlier. C. Aims and Objectives To investigate how does the size of datasets affects the performance of FPM algorithms in mobile devices? To find out how does the performance of FPM algorithms in mobile devices is affected by density and sparsity of different datasets? To find out how do the space requirements of FPM algorithms deviate in mobile devices? To determine how does the execution time of FPM algorithms deviate when tested with different datasets in mobile devices? D. Experimental setup The availability of devices in different configurations with fine-grained computing power is the key requirement to test the application in different environments. Hence, we selected three mobile devices (two smartphones and one tablet) including Samsung Galaxy S4, Lenovo A396, and Asus fonepad. The detailed specification of these devices is presented in Table- III. We developed mobile application (see Figure 2) and tested it rigorously for the performance profiling of FPM algorithms in different mobile environments. We prepared 323 test cases [2 (algorithms) x 2 (input minsup combinations with interval of.5 in range of. and.) x 3 (devices)] and each test case is repeated for five times to get the average value because the performance results of algorithms varied in each iteration. In addition, the minconf for all relevant AR mining algorithms was set as.5. IV. RESULTS AND DISCUSSION In this section, a set of experiments is designed to evaluate the impact of dataset size, sparsity and density on the performance of FPM algorithms. Similarly A set of experiments examines the deviation in the space requirements and execution time of FPM algorithms in mobile devices. Firstly, the Impact of size of datasets on the performance of FPM algorithms is evaluated. Dataset size plays a critical role in far-edge analytic systems due to computational constraints. In this investigation, FPM algorithms when tested with small datasets executed seamlessly and provided maximum results. The effect of dataset size is studied in terms of four attributes: a) lowest minsup value supported by all FPM algorithms in the study, b) size of tree generated for traversals, c) number of (c) 26 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See

Counts Itemsets Counts CPasquier99 small 5 Medium Medium 3 3 CZart small 5 Medium Medium 3 3 CInverse small 5 Medium Medium 3 3 CUncertain small 4 Medium Medium 3 3 DB Utility small 5 Medium Medium

SELECTED DEVICES Specification Samsung Lenovo ASUS Model GT-8552 A39 Fonepad Processor ARM-v7a Dual Core GHz Intel Atom Z242 Storage 8GB 4GB 8GB Available Memory 845MB 52MB GB Operating System

6 This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI.9/ACCESS , IEEE Access JOURNAL OF L A TEX CLASS FILES, VOL. 6, NO., JANUARY 27 6 TABLE II. SELECTED DATASETS Dataset Size Tx Sparsity Density Minsup Cand. Counts Itemsets Counts CPasquier99 small 5 Medium Medium 3 3 CZart small 5 Medium Medium 3 3 CInverse small 5 Medium Medium 3 3 CUncertain small 4 Medium Medium 3 3 DB Utility small 5 Medium Medium CVME small 6 Medium Medium CIGB small 6 Medium Medium 3 3 Mushroom small 824 high low Retail small 8863 low high TABLE III. SELECTED DEVICES Specification Samsung Lenovo ASUS Model GT-8552 A39 Fonepad Processor ARM-v7a Dual Core GHz Intel Atom Z242 Storage 8GB 4GB 8GB Available Memory 845MB 52MB GB Operating System Android 4..2 Android 4. Android 4. Fig. 2. Screenshot of Performance Analyzer candidate items counted during intermediate computations, and d) maximum itemsets counted on the least minsup threshold value. To limit the experimentation, the results were obtained from Apriori algorithm which is reported to be least efficient in terms of candidate generation, item-set counting, and tree size. The results of the experiment are depicted in Figure 3. Fig Candidate Items Itemsets Tree Size Effects of small scale datasets The experiments showed that all algorithms worked fine with small datasets when least minsup =. which itself is a good sign for adopting mobile devices as data mining platform for heavy-weight FPM algorithms. Another effect of datasets is the size of intermediate tree generated to search candidate items and itemsets. The tree size varied for different datasets. For example, the tree did not expanded with DB Utility because utility value of each item is already mentioned within each transaction hence reduced the need for candidate items. Alternately, it increased to 6 levels in case of ContextPasquier99, ContextZart, ContextUncertain, and ContextIGB and 2 levels when processing ContextVME. It could be concluded that nature of data is also important for seamless execution of FPM algorithms. In addition with this, maximum candidate items and itemsets are counted on the basis of minsup value. In this case, 3 candidate items were counted in ContextPasquier99, ContextZart, ContextUncertain, and ContextIGB. The results were different with DB Utility and ContextVME as 27 and 247 items respectively. Finally the results for candidate itemsets remained similar as candidate counts i.e. 3 (ContextPasquier99, ContextZart, ContextUncertain, and ContextIGB), 27 (DB Utility) and 247 (ContextVME). Similarly, the experimental results, as presented in Table- IV, with medium-sized dataset like Mushroom exhibited good performance. Yet, maximum data is processing seamlessly but the mobile devices were needed to be compromised in some situations. For example, Samsung S4 supported seamless execution with least minsup =.35, Lenovno A396 worked best with least minsup =.3 and Asus fonepad supported FPM algorithms with least minsup =.25. Alternately, the number of candidate items and itemsets is huge due to underlying nature of dataset. It was noted that for all three devices the number of candidate items and corresponding itemsets are 597 and 5393 for Samsung S4, 294 and 2587 for Lenovo, and 382 and 2 for Asus fonepad. Finally, for Retail dataset, most of the algorithms underperformed due to extensive memory and processor requirements to mine large datasets. In this case, eight algorithms managed to process the whole data successfully even with lesser minsup (c) 26 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See

7 This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI.9/ACCESS , IEEE Access JOURNAL OF L A TEX CLASS FILES, VOL. 6, NO., JANUARY Candidate Items Itemsets Tree Size Candidate Items Itemsets Tree Size Fig. 4. Effect of Density and Sparsity - CP99 Fig. 5. Effect of Density and Sparsity - Mushroom value i.e..5 as compared to Mushroom with at least.25. These algorithms include Apriori, FPGrowth, Relim, Zart, AprioriRare, MNR, MISApriori, IGB. Similarly, candidate item (24) and itemsets (6) were lesser in this experiment. Here, the question arises about the nature of dataset that even the algorithms completely processing the dataset with exceptionally large number of transactions is producing such small amount of candidate items and itemsets. Hence, it is perceived that there could be some other underlying characteristics of dataset that can directly affect the overall results. Second experiment analyzes the Impact of density and sparsity on the performance of FPM algorithms in mobile devices. The investigations were made to find the effect of density and sparsity on overall results of different datasets. Three datasets ContextPasquier99, Mushroom and Retail were selected in this case. ContextPasquier99 has 5 attributes and 5 transactions, Mushroom has 22 attributes and 824 transactions, and Retail contains 8863 transactions with 6 attributes. Generally, density represents the percentage of populated cells and sparsity represents the percentage of non-populated cells in the dataset. We measured the density and sparsity using Equation- and Equation- 2 where d and s represents density and sparsity receptively. In addition, the dataset contains total N possible items and n populated items. The results of these calculations are presented in Table- V. d = (n/n) () s = d (2) The effects of density and sparsity were measured in terms of a) tree size, b) number of candidate items and c) number of itemsets. Firstly, the tree size in ContextPasquier99 as depicted in Figure- 4 remained small as the tree growth went to 6 levels with least minsup (i.e..) and 3 levels with most minsup (i.e..8). The growth trend reveals that tree-size initially remained higher but start decreasing when minsup values were increased. Similarly, the number of candidate items and resultant itemsets shown the same trend. Maximum number of items and itemsets discovered using ContextPasquier99 remained 3 with least minsup (.). Alternately, minimum number of items and Fig Candidate Items Itemsets Tree Size Effect of Density and Sparsity - Retail itemsets remained 6 and 4 respectively with most minsup (.8). The overall effect on tree-size, items and itemsets with different minsup values can be witnessed in Figure- 4. Secondly, the experiment with Mushroom revealed that the algorithms started discovering items and itemsets when least minsup was.25. Initially, the tree size, plotted on secondaryaxis in Figure- 5, grown up to 2 levels because of sparse nature (i.e. 8 %) and a large range of attributes (i.e. 22). The tree size gradually started decreasing with average growth up to 6 levels when minsup values increased and it came to 4 levels when most minsup was.95. Similarly, the discovered items and itemsets were 597 and 5393 when least minsup was settled as.25 but on average 79 items and 689 itemsets were discovered using Mushroom. Moreover, the discovered items and itemsets with most minsup i.e..95 were reported as 7. Further, the growth trend of all three parameters could be witnessed in Figure- 5. Finally, experiment with retail dataset revealed that tree size grown up to 4 levels with least minsup =.5 but decreased up to 2 levels when most minsup was settled as.55. The apparent effect of high density (i.e %) and minimum number of attributes (i.e. 6) could be witnessed when average growth remained between 2 to 3 levels. Similarly, candidate items and itemsets remained on average 6.8, and 4.36 respectively. Alternately, the results reported with most minsup (i.e..55) showed that tree size grown up to 2 levels (c) 26 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See

8 This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI.9/ACCESS , IEEE Access JOURNAL OF L A TEX CLASS FILES, VOL. 6, NO., JANUARY 27 8 TABLE IV. EFFECTS OF MEDIUM AND LARGE SIZE DATASETS Parameter Samsung S4 Lenovo A396 ASUS Fonepad Mushroom Dataset Least Minsup Tree Size 8 2 Candidate Items Itemsets Retail Dataset Least Minsup Tree Size Candidate Items Itemsets TABLE V. DENSITY AND SPARSITY OF DATASETS Dataset N n d s ContextPasquier % 36% Mushroom % 8% Retail % 2.73% while only 2 candidate items and itemsets discovered. The overall parametric trend of experiment with retail dataset is shown in Figure- 6. After carefully studying the outcomes of all three experiments, we can infer that sparsity and density of the dataset significantly impact the performance of FPM algorithms when deployed in mobile environments. Third experiment examines the deviation in space requirements of FPM algorithms in mobile devices. The importance of space complexity increased due to limited storage and memory resources in mobile devices. The intermediate result generation and internal data structure of algorithms may lead to erroneous execution of data mining algorithms. The experiments were performed to uncover the space consumption trends of FPM algorithms in mobile environments. Again, the tests were performed over small, medium and large-sized datasets. The acquired space consumption trends are depicted in Figures- 7, - 8 and - 9. Space Complexity (MBs) Fig Minsup (%) Space complexity - medium dataset Apriori AprioriTid FP-Growth Relim Eclat declat Charm DefMe Pascal Zart AprioriRare ClosedAR AprioriInverse MISApriori FP-GrowthAR IGB MNR Space Complexity (MBs) Fig Minsup (%) Space complexity - small dataset Apriori AprioriTid FP Growth Relim Eclat declat Charm DefMe Pascal Zart AprioriRare Aclose MNR Uapriori Two-Phase VME AprioriInverse MISApriori The results of small-sized datasets witnessed that most of the algorithms consumed 3±3 MB of average disk space with maximum offset of 726KB during complete execution of algorithms. Hence, this small variation in each iteration limited us to consider the average space consumption. Alter- IGB Space Complexity (MBs) Fig Minsup (%) Space complexity - large dataset Apriori FP Growth Relim Zart AprioriRare MNR MISApriori nately, seven multi-scan algorithms behaved little different and stopped consuming the space when there was no candidate items. These include Apriori, Pascal, Zart, AprioriRare, MNR, UApriori, and AprioriInverse. Rest of the algorithms consumed space for all minsup values because of one-time candidate generation to get rid of repetitive database scans. Space complexity of medium-sized dataset (i.e. Mushroom) is different than small-sized datasets. The relatively larger IGB (c) 26 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See

9 This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI.9/ACCESS , IEEE Access JOURNAL OF L A TEX CLASS FILES, VOL. 6, NO., JANUARY 27 9 amount of data and candidate generation requirements of the algorithms restricted most of the algorithms to find candidate items and itemsets at small minsup values. The performance analysis reveals that all algorithms worked best with least threshold representing 4% minsup. Yet, the average space consumption of all algorithms with 4% and above minsup range remained between 5±3 MB of average disk space but FPGrowth and Eclat varied in performance. Here, FPGrowth consumed less space on average 2±2MB and Eclat consumed 29±4 MB of average disk space. It should be noted that the overall offset in individual experiments remained less than MB hence the average values were rounded off to nearest integer for easy representation and interpretation. Alternately, due to overall effect of sparsity, the results with less than 4% minsup values were varied. Here, Charm is the only algorithm that discovered itemsets even when least minsup was % but in this case it consumed relatively large disk space i.e. 46.2MB. Similarly DefME consumed 4.5MB at 5%, Eclat consumed 5.5MB at 5%, ClosedAR consumed 5.27MB at 2%, and FPGrowthAR consumed 3.35MB of disk space with 3% minsup threshold. In addition five algorithms Apriori, FPGrowth, Eclat, AprioriRare, and AprioriInverse consumed 3.6MB, 3.8MB, 34.2MB, 6.7MB, and 3.5MB respectively with 25% least minsup threshold. It was observed that four algorithms including AprioriTid, Relim, Zart, and MNR did not run with least minsup below 4% threshold. Further analysis of large-sized datasets as depicted in Figure- 9 gives a different view. The dataset was more dense hence the algorithms consumed resources with least minsup threshold of 5%. In addition, the data distribution of underlying dataset enabled to discover itemsets within the range of 5% (least minsup) and 55% (most minsup). The average space consumption of all algorithms varied a little. The detailed analysis are summarized in Table- VI which gives the detailed information about the range of space consumption and maximum offset during each experiment set. In addition, working minsup range where algorithms consumed memory also highlighted in Table- VI. The overall analysis of large-sized dataset showed that space consumption for five algorithms remained 24±3MB with an average maximum offset of 436KB. These five algorithms include Apriori, Relim, Zart, MNR, and IGB. The minimum average space was consumed by FPGrowth which was 2±2MB with an offset of 783KB. Alternately, two algorithms MIS- Apriori and AprioriRare consumed 32±.5MB and 43±2MB respectively. In addition it was noted that FPGrowth was still consuming resources even there were no-candidate itemsets. Hence, it is perceived that multiple database scans lead towards memory efficiency which is an essential requirement in mobile devices. Yet, the multiple database scans could lead towards the optimal performance but the effect on overall time complexity is a prime issue. Hence further investigations were made to find this effect. Finally, the last experiment examines the Deviation in execution time of FPM algorithms in mobile devices. Limited memory and storage space are prime considerations for the deployment of data mining algorithms in mobile environments. Considering these issues, formal analysis of average execution time of all algorithms were made. Again the results were reported for small, medium, and large-sized datasets. It should be noted that results shown in Figures-, - and - 2 are presented with different time-scale. Here the execution time for small datasets is presented in milliseconds (ms). Alternately the execution time of medium and large-sized datasets is measured in seconds(s). Time Consumption (Milliseconds) Fig.. Time Consumption (Seconds) Fig.. Time Consumption (Seconds) Fig Minsup (%) Execution Time- small dataset Minsup (%) Execution Time - medium dataset Minsup (%) Execution Time - large dataset Apriori AprioriTid FP-Growth Relim Eclat declat Charm DefMe Pascal Zart AprioriRare AClose UApriori Two-Phase VME AprioriInverse MISApriori IGB MNR Apriori AprioriTid FP-Growth Relim Eclat declat Charm DefMe Pascal Zart AprioriRare AClose AprioriInverse MISApriori FP-GrowthAR IGB MNR Apriori FP-Growth Relim Zart AprioriRare MISApriori The experimental results of small datasets denote that all algorithms executed seamlessly without any extra time re- IGB MNR (c) 26 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See

10 This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI.9/ACCESS , IEEE Access JOURNAL OF L A TEX CLASS FILES, VOL. 6, NO., JANUARY 27 TABLE VI. SPACE CONSUMPTION OF FPM ALGORITHMS USING RETAIL DATASET Algorithm Average apace consumed Maximum offset Minsup range (%) Apriori 22±4 44KB 5-55 FPGrowth 2±2 783KB - Relim 25±4 98KB 5-55 Zart 24±3 472KB 5-55 AprioriRare 43±2 3KB 5-55 MNR 23±4 834KB 5-55 MISApriori 32±.5.2MB 5-55 IGB 24± 335KB 5-55 quirements. The trend shows that maximum average execution time was 33ms which is also negligible. Overall average execution time of all algorithms remained 5.7ms. Most of the algorithm took average 6±3ms which does not effect the overall performance of algorithms. The details of average execution time of all algorithms are presented in Table- VII below. In addition, maximum offset from average execution time is also presented to asses the time consumption trade-off. Alternately, data size and extensive memory requirements for intermediate computations prolongs the overall execution time in medium and large-sized datasets. In case of Mushroom dataset, all executed algorithms except Eclat, Charm, and Twophase took 5±3 seconds with an average maximum offset of 3 seconds. The average execution time of Eclat, Charm, Two-Phase, and MISApriori algorithms remained.25, 6.9, 27.7, and seconds respectively. The results showed that Two-Phase performed worse in this case with average execution time of 27.7 and offset 6.32 seconds. It should be noted that VME and UApriori failed to execute with Mushroom dataset because of different nature of dataset. In addition, when executed with large dataset most of the algorithms failed to be executed. But the average execution time of running algorithms remained 5±3 seconds with an average maximum offset of 3 seconds. The results as shown in Table- VII depicts that although the average execution time is seamless but the size of dataset affected the overall performance of mobile devices. A. Recommendations The significance of mobile devices as data mining platforms is attributed to intermediate data generation, input dataset size, and search strategies in FPM algorithms. These attributes affect the overall time and space complexity of FPM algorithms. In view of the performance profiling results, it can be concluded that far-edge mobile devices act as data mining platforms for small and medium-sized datasets, but they necessitate some basic parameter tuning for large-sized datasets. Hence, three parameter tuning strategies are proposed to meet the challenge of large-sized dataset mining: a) chunked data analysis, b) context-based knowledge discovery, and c) periodic data analysis. The optimal strategy must ensure seamless maximum data processing with minimal latency. Once divided into manageable data chunks, large datasets may be very useful in this regard. The data mining application needs to index all data chunks and integrate the resultant pattern intelligently to aggregate the final results. Formally, a large dataset D could be equally divided into N number of partitions P and could be mathematically presented in Equation- 3 and- 4. n D = P i (3) where each i= P i = D N In this case, the data management overhead may prolong the overall execution time, but it is perceived that very large datasets can be harnessed easily using this approach. The on-board sensors in mobile devices allow gathering different contextual information relating to locations, activities, device usage, and device (dis)charging patterns. The scheduling of data mining tasks based on contextual information may become very handy to accomplish the seamless execution of FPM algorithms in mobile devices. For example, locationrelated information can enable recognizing frequently visited places like the home and office. Such location information coupled with device-usage patterns may be convenient for scheduling data mining tasks when the device is not being utilized. Hence, constructing a model that can predict suitable times for scheduling data mining tasks can have a significant role in this matter. For example, the model can predict when a person is busy having a meal, sleeping, working in the office, driving on a highway, or exercising in a local park. In addition, information about when a phone is charging, locked, or not in use can also be very helpful for devising a data mining strategy for mobile devices. Finally, the data mining tasks can be scheduled periodically by taking dataset portions without considering contextual information. V. A FRAMEWORK FOR FREQUENT PATTERN MINING IN MECC SYSTEMS Considering the mentioned recommendations, it is perceived that maximum FPM algorithm execution can be easily performed locally without the need for remote data processing systems like servers and cloud services. However, we propose a framework for frequent pattern mining in MECC environments (Figure 3). This framework facilitates six types of operations, namely data stream acquisition, data preprocessing, data fusion, data management, frequent pattern mining and summarization, and context-aware computation offloading. A. Data Stream Acquisition: The profiling performance results revealed that dataset size plays a vital role in the seamless execution of FPM algorithms. (4) (c) 26 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See

AVERAGE EXECUTION TIME OF FPM ALGORITHMS Algorithm Small-size datasets Mushroom Retail Time (ms) Offset (ms) Time (s) Offset (s) Time (s) Offset (s) Apriori 5.22 3. 6.8.76 8. 3.39 AprioriTid 5.99 2.

38 failed Pascal 5.72 4.65 6.9 6.89 failed Zart 8. 2.45 5.99 3.52 3.63.54 AprioriRare 3.8 4.3 6.88 3.84 2.6 6.34 AClose.2 6.48 8.5 3.49 failed UApriori 6. 6.2 failed failed Two-Phase.4 2.2 27.7 6.

3 6 Context Aware Computation Offloading D D2 P P2 Edge Servers Infrastructure Based Cloud D3 P3 F Micro Cloud D4 P4 Wireless Hub Smart Router Internet D5 P5 Data Streams Data Preprocessing Data

11 This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI.9/ACCESS , IEEE Access JOURNAL OF L A TEX CLASS FILES, VOL. 6, NO., JANUARY 27 TABLE VII. AVERAGE EXECUTION TIME OF FPM ALGORITHMS Algorithm Small-size datasets Mushroom Retail Time (ms) Offset (ms) Time (s) Offset (s) Time (s) Offset (s) Apriori AprioriTid failed FPGrowth Relim Eclat failed declat failed Charm failed DefMe failed Pascal failed Zart AprioriRare AClose failed UApriori failed failed Two-Phase failed VME failed failed AprioriInverse failed MISApriori IGB MNR Context Aware Computation Offloading D D2 P P2 Edge Servers Infrastructure Based Cloud D3 P3 F Micro Cloud D4 P4 Wireless Hub Smart Router Internet D5 P5 Data Streams Data Preprocessing Data Fusion Data Management Far-edge Mobile Environment Cloudlet Frequent Pattern Mining and Summarization 5 Edge-Cloud Environment Fig. 3. Proposed Framework for FPM in MECC System. Therefore, the input dataset volume needs to be handled carefully. In case of live data streams, the velocity of incoming data is another key consideration. To handle the volume and velocity aspects, the datasets and data streams are divided into equal-sized data chunks and traditional sliding windowbased methods are used to traverse the data files. The sizes of the sliding windows vary according to the FPM algorithms selected and the required minsup and minconf threshold values. However, sliding window expansion yields better results, as most algorithms prefer keeping maximum data in memory to avoid frequent database scans and input file reads. B. Data Preprocessing and data fusion: Dataset density and sparsity are also critical in the performance maximization of FPM algorithms, as discussed in section 4. In addition, live sensor data collection may introduce noisy data streams due to false sensor calibrations (e.g. radio signal interference from other devices or systems in the surrounding) and improper sensor placement (e.g. sensor position, number, and orientation). Therefore, each dataset and/or data stream is preprocessed separately to obtain useful, noise-free, and highly-dense data chunks (c) 26 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See

12 This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI.9/ACCESS , IEEE Access JOURNAL OF L A TEX CLASS FILES, VOL. 6, NO., JANUARY 27 2 C. Data Fusion: Data fusion strategies enable integrating preprocessed data chunks from multiple data sources. However, the strategies vary according to the algorithms nature and mobile application functionalities. A thorough analysis revealed that the data structure and data processing behavior of FPM algorithms significantly impact FPM algorithm performance. Multi-scan algorithms like AprioriTid are memory efficient but computeintensive. Conversely, Apriori is a single-scan algorithm with fewer computational requirements but more memory requirements. Therefore, data fusion strategies are designed keeping in view the processing behavior and internal data structures of FPM algorithms. The second main consideration for data fusion is the core functionalities provided by mobile applications, i.e. type of data source, like static datasets for batch processing or live data streams for stream processing, and data they produce (structured, unstructured, or semi-structured). Because FPM algorithms may need to be re-configured to accommodate the required data types, the variability in live data streams is another factor that needs attention. The variability factor emerges owing to the diverse sampling rates of different sensors in mobile applications. An example is a mobile application that finds the association among frequent physical activities and locations visited by a mobile user. The accelerometer for activity recognition is usually a sample of over 25 readings per second. However, the users location does not change so frequently, so data fusion from these two data sources is a challenging task. The complexity gradually increases as additional data sources are involved in mobile applications. D. Data Management: MECC systems involve massive heterogeneity in terms of mobile device processing power, bandwidth availability for different communication interfaces, and Internet connection availability for communications between far-edge mobile devices and edge servers. Therefore, the optimal execution of real-time mobile applications becomes an NP-hard problem. The data management of datasets and data streams in transient data stores helps cope with this problem. The data streams are stored in multiple data files and processed whenever adequate resources are available either in far-edge mobile devices or in edge-cloud environments. E. Frequent Pattern Mining and Summarization: The framework supports FPM algorithm execution in three modes. Far-edge mobile devices serve as a primary platform for data mining; however, due to the limited resource availability and high computational requirements of some FPM algorithms, the data files are processed in either edge servers or back-end cloud infrastructures. The multi-stage processing of FPM algorithms requires efficient summarization strategies for pattern aggregation and overall knowledge management. Therefore, the pattern summarization operations are performed in a back-end cloud environment and knowledge synchronization is performed between far-edge mobile devices and edgecloud infrastructures. F. Context Aware Computation Offloading: The framework is designed to support context-aware computation offloading for the seamless execution of FPM algorithms in MECC systems. To achieve optimal load-balancing and dynamic offloading, deep-context models must find the finegrained situations and determine the adequate execution mode for FPM algorithms. The context model assists with determining the correlation between far-edge mobile devices and their resource availability, usage behavior, (dis)charging patterns and movement patterns, the sequential patterns of Internet connection availability, the maximum size of data chunks, and the execution histories of FPM algorithms. Although context management in MECC systems helps in devising optimal execution strategies, the design of such context model is challenging. So far, we carried out the performance profiling of FPM algorithms, articulated the relevant challenges and proposed a framework to address the challenges discovered in this significant research area. However, the proposed framework still needs further evaluation. VI. CONCLUSION Research on far-edge analytics is still in its early stages. Deploying data mining algorithms in far-edge devices, such as smartphones, wearable devices, wireless body area networks and mobile IoTs can help reduce massive data streams that are being transmitted to cloud data centers. This data reduction in far-edge mobile devices lowers the data transfer and bandwidth utilization costs in MECC systems. However, the performance profiling of FPM algorithms revealed that the size and sparsity of large datasets hamper the performance of faredge mobile devices. To address these problems, we proposed three performance tuning strategies and a framework for the execution of FPM algorithms in MECC systems. In future, we will continue this research by developing real-world case studies for the proposed framework. The main focus of our research will be on performing real-time analytics on mobile streaming data in MECC systems. ACKNOWLEDGMENT This research has been funded by Bright spark unit, university of Malaya (BSP-632-3). REFERENCES [] A. V. Dastjerdi and R. Buyya, Fog computing: Helping the internet of things realize its potential, Computer, vol. 49, no. 8, pp. 2 6, 26. [2] F. Bonomi, R. Milito, J. Zhu, and S. Addepalli, Fog computing and its role in the internet of things, in Proceedings of the first edition of the MCC workshop on Mobile cloud computing. ACM, 22, pp [3] M. H. Rehman, C. S. Liew, T. Y. Wah, and M. K. Khan, Towards next-generation heterogeneous mobile data stream mining applications: Opportunities, challenges, and future research directions, Journal of Network and Computer Applications, vol. 79, pp. 24, 27. [Online]. Available: // [4] G. I. Klas, Fog computing and mobile edge cloud gain momentum open fog consortium, etsi mec and cloudlets, 25. [Online]. Available: (c) 26 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See

13 This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI.9/ACCESS , IEEE Access JOURNAL OF L A TEX CLASS FILES, VOL. 6, NO., JANUARY 27 3 [5] T. H. Luan, L. Gao, Z. Li, Y. Xiang, and L. Sun, Fog computing: Focusing on mobile users at the edge, arxiv preprint arxiv:52.85, 25. [6] K. Ha and M. Satyanarayanan, Openstack++ for cloudlet deployment, School of Computer Science Carnegie Mellon University Pittsburgh, 25. [7] K. A. ALAM and R. AHMAD, A hybrid fuzzy multi-criteria decision model for cloud service selection and importance degree of component services in service compositions, in Uncertainty Modelling in Knowledge Engineering and Decision Making: Proceedings of the 2th International FLINS Conference (FLINS 26), vol.. World Scientific, 26, p [8] H. Gupta, A. V. Dastjerdi, S. K. Ghosh, and R. Buyya, ifogsim: A toolkit for modeling and simulation of resource management techniques in internet of things, edge and fog computing environments, arxiv preprint arxiv:66.27, 26. [9] H. Dubey, J. Yang, N. Constant, A. M. Amiri, Q. Yang, and K. Makodiya, Fog data: Enhancing telehealth big data through fog computing, in Proceedings of the ASE BigData & SocialInformatics 25. ACM, 25, p. 4. [] M. H. Rehman, C. S. Liew, T. Y. Wah, J. Shuja, and B. Daghighi, Mining personal data using smartphones and wearable devices: A survey, Sensors, vol. 5, no. 2, pp , 25. [] O. D. Lara and M. A. Labrador, A survey on human activity recognition using wearable sensors, Communications Surveys & Tutorials, IEEE, vol. 5, no. 3, pp , 23. [2] M. H. Rehman, C. S. Liew, and T. Y. Wah, Uniminer: Towards a unified framework for data mining, in Information and Communication Technologies (WICT), 24 Fourth World Congress on. IEEE, 24, pp [3] M. M. Gaber, J. Gama, S. Krishnaswamy, J. B. Gomes, and F. Stahl, Data stream mining in ubiquitous environments: state-of-the-art and current directions, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 4, no. 2, pp. 6 38, 24. [4] Z. S. Abdallah, M. M. Gaber, B. Srinivasan, and S. Krishnaswamy, Streamar: incremental and active learning with evolving sensory data for activity recognition, in Tools with Artificial Intelligence (ICTAI), 22 IEEE 24th International Conference on, vol.. IEEE, 22, pp [5] P. Liu, Y. Chen, W. Tang, and Q. Yue, Mobile weka as data mining tool on android, in Advances in Electrical Engineering and Automation. Springer, 22, pp [6] S. Krishnaswamy, M. Gaber, M. Harbach, C. Hugues, A. Sinha, B. Gillick, P. Haghighi, and A. Zaslavsky, Open mobile miner: a toolkit for mobile data stream mining, ACM KDD9, 29. [7] P. P. Jayaraman, J. B. Gomes, H. L. Nguyen, Z. S. Abdallah, S. Krishnaswamy, and A. Zaslavsky, Cardap: A scalable energy-efficient context aware distributed mobile data analytics platform for the fog, in Advances in Databases and Information Systems. Springer, 24, pp [8] F. Stahl, M. M. Gaber, M. Bramer, and P. S. Yu, Pocket data mining: Towards collaborative data mining in mobile computing environments, in Tools with Artificial Intelligence (ICTAI), 2 22nd IEEE International Conference on, vol. 2. IEEE, 2, pp [9] R. Agrawal, R. Srikant et al., Fast algorithms for mining association rules, in Proc. 2th int. conf. very large data bases, VLDB, vol. 25, 994, pp [2] M. H. Rehman, C. S. Liew, and T. Y. Wah, Frequent pattern mining in mobile devices: A feasibility study, in Information Technology and Multimedia (ICIMU), 24 International Conference on. IEEE, 24, pp [2] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal, Discovering frequent closed itemsets for association rules, in Database Theory ICDT99. Springer, 999, pp [22] Y. Bastide, R. Taouil, N. Pasquier, G. Stumme, and L. Lakhal, Mining frequent patterns with counting inference, ACM SIGKDD Explorations Newsletter, vol. 2, no. 2, pp , 2. [23] J. Han, J. Pei, Y. Yin, and R. Mao, Mining frequent patterns without candidate generation: A frequent-pattern tree approach, Data mining and knowledge discovery, vol. 8, no., pp , 24. [24] C. Borgelt, Keeping things simple: Finding frequent item sets by recursive elimination, in Proceedings of the st international workshop on open source data mining: frequent pattern mining implementations. ACM, 25, pp [25] M. J. Zaki, Scalable algorithms for association mining, Knowledge and Data Engineering, IEEE Transactions on, vol. 2, no. 3, pp , 2. [26] M. J. Zaki and K. Gouda, Fast vertical mining using diffsets, in Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 23, pp [27] M. J. Zaki and C.-J. Hsiao, Charm: An efficient algorithm for closed itemset mining. in SDM, vol. 2. SIAM, 22, pp [28] L. Szathmary, A. Napoli, S. O. Kuznetsov et al., Zart: A multifunctional itemset mining algorithm, 26. [29] L. Szathmary, A. Napoli, and P. Valtchev, Towards rare itemset mining, in Tools with Artificial Intelligence, 27. ICTAI 27. 9th IEEE International Conference on, vol.. IEEE, 27, pp [3] B. Liu, W. Hsu, and Y. Ma, Mining association rules with multiple minimum supports, in Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 999, pp [3] Y.-H. Hu and Y.-L. Chen, Mining association rules with multiple minimum supports: a new mining algorithm and a support tuning mechanism, Decision Support Systems, vol. 42, no., pp. 24, 26. [32] Y. S. Koh and N. Rountree, Finding sporadic rules using aprioriinverse, in Advances in Knowledge Discovery and Data Mining. Springer, 25, pp [33] C.-K. Chui, B. Kao, and E. Hung, Mining frequent itemsets from uncertain data, in Advances in knowledge discovery and data mining. Springer, 27, pp [34] Y. Liu, W.-k. Liao, and A. Choudhary, A two-phase algorithm for fast discovery of high utility itemsets, in Advances in Knowledge Discovery and Data Mining. Springer, 25, pp [35] P. Fournier-Viger, C.-W. Wu, S. Zida, and V. S. Tseng, Fhm: Faster high-utility itemset mining using estimated utility co-occurrence pruning, in Foundations of Intelligent Systems. Springer, 24, pp [36] Z. Deng and X. A. I. A. E. I. Xu, An efficient algorithm for mining erasable itemsets, in Advanced Data Mining and Applications. Springer, 2, pp [37] Z.-H. Deng, G.-D. Fang, Z.-H. Wang, and X.-R. Xu, Mining erasable itemsets, in Machine Learning and Cybernetics, 29 International Conference on, vol.. IEEE, 29, pp [38] G. Gasmi, S. B. Yahia, E. M. Nguifo, and Y. Slimani, A new informative generic base of association rules, in Advances in Knowledge Discovery and Data Mining. Springer, 25, pp [39] M. Kryszkiewicz, Representative association rules and minimum condition maximum consequence association rules, in Principles of Data Mining and Knowledge Discovery. Springer, 998, pp [4] P. Fournier-Viger, C.-W. Wu, and V. S. Tseng, Mining top-k association rules, in Advances in Artificial Intelligence. Springer, 22, pp [4] P. Fournier-Viger and V. S. Tseng, Mining top-k non-redundant association rules, in Foundations of Intelligent Systems. Springer, 22, pp [42] P. Fournier-Viger, Spmf: A sequential pattern mining framework, URL philippe-fournier-viger. com/spmf, 2. [43] K. Bache and M. Lichman, Uci machine learning repository, URL ics. uci. edu/ml, vol. 9, (c) 26 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See

, JANUARY 27 4 Khubaib Amjad Alam is a Doctoral candidate in Faculty of Computer Science and Information Technology, University of Malaya, Malaysia, under the prestigious Bright Spark scheme.

His research interests include service and cloud computing, decision support systems, software evolution, knowledge engineering and applications of Soft computing methodologies.

14 This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI.9/ACCESS , IEEE Access JOURNAL OF L A TEX CLASS FILES, VOL. 6, NO., JANUARY 27 4 Khubaib Amjad Alam is a Doctoral candidate in Faculty of Computer Science and Information Technology, University of Malaya, Malaysia, under the prestigious Bright Spark scheme. He received his Masters in Computer Science degree in 23 from Comsats Institute of Information Technology, Wah Cantt, Pakistan. His research interests include service and cloud computing, decision support systems, software evolution, knowledge engineering and applications of Soft computing methodologies. He has published several articles in High impact journals and reputed conferences including Information Systems, Quality and Quantity, IJIM, Energy, ECM, CloudCom and FLINS. He has also served as an evaluator for numerous SCI-E indexed journals and international conferences including KBS, IEEE Access, IJITDM, IJWSR, ICMNEE and FIT. He is a member of ACM SIGSOFT, IEEE TCSE, IEEE TCBIS and International Society on MCDM. Rodina Ahmad Assoc Prof. Dr. Rodina Ahmad is a university lecturer with over 2 years of experience revolving around teaching, researching, supervising, advising and consulting in universities and governmental agencies. Programming languages analysis and software development management are her passion. She has a wealth of academic and crossindustry experience in the areas of enterprise analysis and modelling, software requirements engineering and computer assisted education for computer programming. She holds a Doctor of Philosophy degree in Information System Management from the National University of Malaysia and a Master degree in Computer Science from the Rensselaer Polytechnic, USA. She has received outstanding academic grant for her studies include the National Petronas scholarship and Sultan Iskandar scholarship. Her research interests are software process improvement, Intelligent Tutoring System for programming, enterprise analysis, global software development management, computer assisted learning and Software Requirements Engineering. To find out more, contact her through rodina@um.edu.my Kwangman KO Kwangman Ko (kkman@sangji.ac.kr) is a professor at the Department of Computer Engineering, School of Computer and Information Engineering, Sang-Ji University, Korea. He received his B.Sc. from Won-Kwang University in 99, and his M.Sc. and Ph.D. in computer engineering from Dong-Guk University, Korea, in 993 and 998, respectively. His research interests include the re-targetable tool suite for embedded systems, virtual machines, and architecture description language (c) 26 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See

Mining Imperfectly Sporadic Rules with Two Thresholds

Mining Imperfectly Sporadic Rules with Two Thresholds Cu Thu Thuy and Do Van Thanh Abstract A sporadic rule is an association rule which has low support but high confidence. In general, sporadic rules