EFFICIENT MINING OF FAST FREQUENT ITEM SET DISCOVERY FROM STOCK MARKET DATA

Size: px

Start display at page:

Download "EFFICIENT MINING OF FAST FREQUENT ITEM SET DISCOVERY FROM STOCK MARKET DATA"

Norma Baker
6 years ago
Views:

1 EFFICIENT MINING OF FAST FREQUENT ITEM SET DISCOVERY FROM STOCK MARKET DATA Hitesh R Raval 1, Dr.Vikram Kaushik 2 Volume 6, Issue 6, June (2015), pp Article ID: International Journal of Computer Engineering & Technology (IJCET) IAEME: ISSN (Print) ISSN (Online) IJCET I A E M E 1 Research Scholar, Mewar University Chittorgarh, Rajasthan 2 Director, Manish Institute of Computer Studies, Visnagar ABSTRACT Stock market is a changeable environment. Traditional data analysis techniques using of some tools can provide investors to manage stocks and predict prices. However these traditional techniques cannot determine all the possible relations between stocks and that s why needing a different approach that will provide deeper kind of analysis. Data mining can be use comprehensively in the stock-price predicting. In this paper investigators propose a new approach with efficient preprocessing, pruning data structure techniques to discover inter-transaction association rules with business intelligence characteristics. Propose work also provides better in-depth study of intertransaction stock price movement of companies to financial research community, money managers, fund managers, investors, etc. Keywords: Data Mining, Stock, Inter Transaction, Association Rule. Preprocessing, Pruning. 1. INTRODUCTION Data mining play very important role in applications developing and research in the area of knowledge discovery since last decade. There are several methods available into data mining like association rule mining, classification, clustering etc. Association rule mining [1] is one of the techniques of data mining by which several related items (patterns) are discovered from large amount of data. Mining association rules is a two step process. In first step, sets of items (candidate item-sets) which occur frequently in database are found using some known algorithm like Apriori. The frequency (support) of item-set is the no. of transactions containing all the items of that item-set. The frequent item-set is one whose support is greater than or equal to some predefined parameter minimum support (min_sup). After frequent item-sets are found, association rules are developed from them by considering one more parameter minimum confidence (min_conf). Temporal data mining provides some additional capabilities required in cases where the evolution of the existing data and their interactions need to be observed through the time dimension [2]. Temporal data mining can be defined as the activity of looking for interesting correlations or patterns in large sets of temporal data accumulated for other purposes [3]. Main aim of a stock market is, dealing of stocks between investors. Stocks are grouped into industry according to their primary business focus [4]. Each stock is not only characterized by its price but also by many others variable. Dealing with all 16 editor@iaeme.com

2 these variables and determine the result its required deeper kind of study. So, it could show the behavior of a stock over time. The main variables are shown in the table below [5][6]. STOCK VARIABLES Variable Price Opening Price Closing Price Volume Change Change (U) Description Current price of a stock Opening price of a stock on specific trading day Closing price of a stock on specific trading day Transactions Volume (buy / sell) Percentile Opening and Closing stock value difference Percentile opening and closing stock value difference 1.1 Temporal Data Mining Temporal data mining is concerned with data mining of large sequential data sets. By sequential data, it means data that is ordered with respect to some index. It provides some additional capabilities required in cases where the evaluation of the existing data and their interactions need to be observed through the time dimension [7]. 1.2 Time Series Order sequence of data point is a time Series. It is measured at successive times spaced at uniform time intervals. A large amount of data is collected everyday in the form of event time sequences. These sequences represent valuable sources of information not only to search for a particular value or event at a specific time, but also to analyze the frequency of certain events, discover their regularity, or discover set of events related by particular temporal relationships [9]. 2. THE PRINCIPLE OF THE APRIORI ALGORITHM. Apriori method is the common approaches to mining frequent patterns. In apriori algorithm support is the fraction of entities which consumed the itemsets in any of their possible transaction. After identifying the large itemsets the itemsets with support greater than the minimum support allowed, they are translated to an integer and each sequence is transformed in a new sequence whose elements are the large itemsets of the previous-one. The next step is to find the large sequences. For achieve this, the algorithm acts iteratively as apriori: first it generates the candidate sequences and then it chooses the large sequences from the candidate ones, until there are no candidates. An effective algorithm to discover association rules is the apriori. Adapting this method to deal with temporal information leads to some different approaches. Common sub-sequences can be used to derive association rules with predictive value, as is done, for instance, in the analysis of multidimensional time series [9][14]. A possible approach consists on extending the notion of a typical rule X Y (which states if X occurs then Y occurs) to be a rule with a new meaning: X T Y (which states: if X occurs then Y will occur within time T). In order to discover these rules, it is necessary to search for them in a restrict portion of time, since they may occur repeatedly at specific time instants but on a little portion of the global time considered. A method to discover such rules is applying an algorithm similar to the apriori, and after having the set of traditional rules, detects the cycles behind the rules [14] editor@iaeme.com

3 Definition 1: The support of an item (or set of items) is the numbers of transactions in which that item (or items) occur. Given a set of transactions in a database where each letter corresponds to a certain product such as Jeans or T-shirt and each transaction corresponds to a customer buying the products A, B, C or D the first step in the apriori algorithm is to count the support (number of occurrences) of each item separately. T1 T2 T3 T4 T5 T6 Table 1 Transaction Items A, B, C, D B, C, D B, C A, B, D A, B, C, D B, D Table 2 Item Support A 3 B 6 C 4 D 5 The items in the transactions represented in Table 1 have their support represented in Table 2. Definition 2: The support threshold is defined by the user and is a number for which the support for each item (or items) has to be equal or above for the support threshold to be fulfilled [13] In this example we will use support threshold = 3. This means all items in table 2 meets this condition since none of them have a support below 2 as seen in Table 2. Definition 3: Given a set of items I = {I1 I2 In} an item set is a subset of I [13]. Definition 4: A large item set is an item set whose numbers of occurrences in the transactions is above the support threshold. Here use the notation L to indicate the complete set of large item sets [13]. In example the complete set of large itemset L in this first iteration is L = {A, B, C, D} since all of these terms meets the support threshold. If any of these items had been below the support threshold they had not been included in the subsequent steps. In the next steps will form all pairs, triples and so on of the items in Table 2. If A would have a support threshold below three all pairs, triples etc containing A would also be below the support threshold. This is the fundamental basis of the apriori algorithm since it allows us to prune all transactions having only items under the support threshold, hence reducing the amount of data in each step. The next step is to form all 2-pair item sets. Can do this by making all possible combinations of the large item sets without regarding the order editor@iaeme.com

4 Table 3 Item Support A,B 3 A,C 2 A,D 3 B,C 4 B,D 5 C,D 3 Table 4 Large Itemsets A,B A,D B,C B,D C,D In table 3 the new itemsets are illustrated together with respective support. The item set A, C only have support 2 and since our support threshold is 3 the item set is not a large item set. Next generate the 3-sets by joining the full set of large item sets in table 4 over a common item. Table 5 Item Support A,B,C 2 A,C,D 2 A,B,D 3 B,C,D 3 Table 6 Large Itemsets A,B,C B,C,D The only 3-set that fulfills the support threshold is {A, B, D} and {B, C, D} as illustrated in table 6. If continue this process by joining the item sets in the complete large item set over a common pair user get the last possible combination. Table 7 Item Support A,B,C,D editor@iaeme.com

5 Figure 1. Figure 1. Illustration of the possible combinations of the A, B, C, D without regarding the order in the apriori algorithm. The process of joining terms in the apriori algorithm is illustrated in figure 1. Note that the position of item in the itemsets doesn t matter. i.e. the item set {A, B, D} is regarded in the same way as {D, A, B} and to keep track of this don t get any redundancies later in the implementation all items in each item set is ordered by its value. The apriori algorithm cuts some of the branches in the tree in figure 1, for example the item set {A,C} did only occur 2 times which was below the support threshold at 3. The apriori algorithm makes use of this by not generating any branches from this node and thus reduces the computational cost. This is as said the foundation of the apriori algorithm [9]. Algorithm 1. Apriori algorithm [14] Input: I //Itemsets D //Transactions S //support threshold Output: L // large itemsets Apriori algorithm k = 0 // k is used as the scan number L = Ø C1 = I //Initial candidates are set to be the items Repeat k = k + 1 Lk = Ø for each Ii Ck do ci = 0 //Initial counts for each itemset are 0 for each tj D do for each Ii Ck do if Ii tj then ci = ci + 1 for each Ii Ck do if ci s do Lk = Lk U Ii L = L U Lk Ck+1 = apriori-gen (Lk) 20 editor@iaeme.com

6 Algorithm 2. Apriori-Gen algorithm [13] Input: Li-1 //Large itemsets of size i-1 Output: Ci //Candidates of size i Apriori-Gen algorithm Ci = Ø for each I Li-1 for each J I Li-1 do if i-2 of the elements in I and J are equal then Ck = Ck U {I U J}. 2.1 Impact of the Algorithm Past few years, numbers of studies have been published new algorithms or improvements on existing algorithm to solve frequent itemset mining. Some algorithms require a small amount of memory, but heavy disk access (such as Apriori-like algorithms); others necessitate low I/O activity, but large amount of memory such as FP-growth. However, the number of research papers on the inter-transaction mining problem is still few since it is a more challenging problem than intratransaction mining. 3. PROPOSED WORK Luhr et al. [10] and Tung et al. [11] proposed a framework that can only discover intertransaction association rules, whereas proposed an approach to mine quantitative intra-transaction association rules. In order to discover quantitative inter-transaction association rules, a new approach is developed to extract rules from single dimensional transaction datasets of stock closing price and traded volume. Investigator proposed framework called ITARM for mining inter-transaction association rules for the dataset that contains constant number of items in each transaction. Investigator approach not only predicts the movement of stock price in either direction with user defined minimum support and confidence value but also predicts the probable variation in stock closing price and traded quantity based on historical data of attributes. The proposed approach employs an effective preprocessing phase as shown in figure 2 that reduces over all mining time and efficient pruning techniques to eliminate frequent itemsets which are occurring in same transaction and also which are not started with first transaction of each sliding window. ITARM uses FP-tree based algorithm because it requires only two scan of transaction database to construct a tree and uses prefix-tree structure which requires less memory. Figure 2. Process of mining inter-transaction association rules (ITARM) 21 editor@iaeme.com

7 4. THE PROPOSED ALGORITHM (FP-TREE CONSTRUCTION AND PRUNING) Input: A transaction database D and a minimum support threshold minsup Output: Its frequent pattern tree, FP-tree Method: The FP-tree is constructed in the following steps 1. Scan the transaction database D once. Collect the set of frequent items F and their supports. Sort F in support descending order as L, the list of frequent items. 2. Create the root of an FP-tree, T, and label it as null. For each transaction Trans in D do the following. Select and sort the frequent items in Trans according to the order of L. Let the sorted frequent item list in Trans be [p P], where p is the first element and P is the remaining list. Call insert tree ([p P], T). The function insert tree ([p P], T) is performed as follows. If T has a child N such that N.item-name = p.item-name, then increment N s count by 1; else create a new node N, and let its count be 1, its parent link be linked to T, and its node-link be linked to the nodes with the same item-name via the node-link structure. If P is nonempty, call insert tree (P, N) recursively. 3. Sort (P, size of p). If sorted frequent items has first digit 1 than store itemset in hash table else skip itemset because it is not satisfying inter-transaction criteria. After tree construction over, sort frequent mega-transaction itemsets on their sub window number. If each sorted frequent itemsets has sub window number 1 than store itemset in hash table else skip itemset because it is not satisfying inter-transaction criteria. Step 3 shows modification to current FP-tree algorithm to incorporate frequent inter-transaction mining. Algorithm (FP-growth: Mining frequent inter-transaction patterns with FP-tree by pattern fragment growth and pruning) Input: Constructed FP-tree, using transaction database D and a minimum support threshold minsup Output: The complete set of inter-transaction frequent patterns Method: Call FP-growth (FP-tree, null) Procedure FP-growth (Tree, α) 1. if Tree contains a single path P 2. then for each combination (denoted as β) of the nodes in the path P do 3. generate pattern β α with support = minimum support of nodes in β; 4. else for each ai in the header of Tree do { 5. generate pattern β = ai α with support = ai.support; 6. construct β s conditional pattern base and then β s conditional FP-tree Tree β ; 7. if Tree β Ø 8. then call FP-growth (Tree β, β) 9. if Tree β has all items starting with then it confirms intra-transaction rule, Prune such instances else keep 22 editor@iaeme.com

8 5. RELATED WORK While the amount of data increases gradually, the frequent itemsets of inter-transaction association rule will become larger and larger and hard to handle [12]. At least there is no algorithm available that deals with quantitative inter-transaction association rules. Unfortunately very less work has been done to discover this kind of rules in financial domain and this area is still emerging. 6. CONCLUSION Frequent itemset mining and association rule mining are the two important tasks of data mining. Incorporating utility considerations in data mining tasks is gaining popularity in recent years. In this paper, we are trying to discovering itemsets under all time windows of data streams can be achieved effectively with limited memory space, less candidate itemsets and CPU I/O. This meets the critical requirements of time and space efficiency for mining data streams. In this paper we have taken two most influencing factors - closing price and traded volume of the National Stock Exchange (NSE) India, as company stock price is following movement of higher index based companies for reasonable higher support and confidence. Stock closing price, traded volume, business sector, earning numbers, Price Earnings ratio, rumors etc. are some of the factors that influence the stock price. Obviously, all these factors cannot be easily modeled and embedded, since some of them are related with human psychology. Enhanced mechanism that provides better trade-off between main memory and interactive inter-transaction rule. We have considered set boundary for symbolic representation, applying fuzzy logic can provide more accuracy with higher computation complexity. REFERENCES 1. Data Mining Concepts & Techniques by Jiawei Han and Micheline Kambler 2. Agrawal, R., Imielinski, T., and Swai, A. Mining as ociation rules between sets of items in large databases. In Proceedings of 1993 ACM S IGMOD Intl Conf. On Management of Data, pages Washington, D. C., May Bettini, C., Wang, X. S., and Jajodia, S. Testing complex temporal relationships involving multiple granularities its application to data mining. In Proc of the 15th ACM SIGACT- SIGMOD-SIGART Symposium on Principles of Database Systems, June 3-5, 1996, Montreal, Canada, pages ACM Press, Ayn, N.F., Tansel, A.U., and Arun, E. An efficient algorithm to update large itemsets with early pruning Proceedings of the Fifth AC M SIGKDD International Conference on Knowledge Discovery and Data Mining San Diego, August Gerasimos Marketos, Konstantinos Pediaditakis, Yannis Theodoridis, and Babis Theodoulidis. Intelligent Stock Market Assistant Using Temporal Data Mining. 6. Mrs. Keerti. S. Mahajan & R. V. Kulkarni. Application of Data mining Tools For Stock Market. 7. Cláudia M. Antunes, and Arlindo L. Oliveira. Temporal Data Mining: an overview: Institute Superior Technical, Dep. Engenharia Informatics, Av. Rovisco Pais 1, Lisboa, Portugal, page Cheung, D., Han, J., Ng, V., and Wong, C.Y. Maintenance of Discovered Association Rules in Large Databases: An Incremental Updating Technique. Proc of 1996 Int l Conf. on Data Engineering, pages , February Swati Soni and sini shibu. Advance Mining of Temporal High Utility Itemset. I.J. Information Technology and Computer Science,2012,4, Page no 26-32, Published online April 2012 in MECS 23 editor@iaeme.com

9 10. S. Luhr and S. Venkatesh. An extended frequent pattern tree for inter-transaction association rule mining. Technical Report, A. K. H. Tung, H. Lu, J. Han, and L. Feng. Breaking the barrier of transactions: Mining intertransaction association rules. Knowledge Discovery and Data Mining, pages , August J. Dong and M. Han. Ifcia: An efficient algorithm for mining inter-transaction frequent closed itemsets. FSKD 07: In Proceedings of the 4th International Conference on Fuzzy Systems and Knowledge Discovery, pages , August Dunham, Margaret H: Data mining: Introductory and advanced topics, ch 6. Prentice Hall; 1 edition (September 1, 2002). ISBN: Rajesh V. Argiddi and Sulabha S. Apte, An Evolutionary Fragment Mining Approach To Extract Stock Market Behavior For Investment Portfolio International journal of Computer Engineering & Technology (IJCET), Volume 4, Issue 5, 2013, pp , ISSN Print: , ISSN Online: R. Lakshman Naik, D. Ramesh, B. Manjula, Instances Selection Using Advance Data Mining Techniques International journal of Computer Engineering & Technology (IJCET), Volume 3, Issue 2, 2012, pp , ISSN Print: , ISSN Online: Mrs. Charmy Patel and Dr. Ravi Gulati, Software Performance Analysis: A Data Mining Approach International Journal of Information Technology and Management Information Systems (IJITMIS), Volume 5, Issue 2, 2014, pp , ISSN Print: , ISSN Online: editor@iaeme.com

Itemset. Swati Soni Research (India) Prof. Sini shibu. mecs-press.org/) include new. old. the behavior. table below.

Itemset. Swati Soni Research (India) Prof. Sini shibu. mecs-press.org/) include new. old. the behavior. table below. I.J. Information Technology and Computer Science, 2012, 4, 26-32 Published Online April 2012 in MECS (http://www.m( mecs-press.org/) DOI: 10.5815/ijitcs.2012.04.04 Advance Mining of Temporall Highh Utility