CHAPTER-5 A HASH BASED APPROACH FOR FREQUENT PATTERN MINING TO IDENTIFY SUSPICIOUS FINANCIAL PATTERNS

Similar documents
Data Mining Part 3. Associations Rules

Apriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

Association Rules. A. Bellaachia Page: 1

2. Discovery of Association Rules

Lecture notes for April 6, 2005

Association mining rules

Frequent Itemsets Melange

Tutorial on Association Rule Mining

gspan: Graph-Based Substructure Pattern Mining

International Journal of Computer Trends and Technology (IJCTT) volume 27 Number 2 September 2015

A Fast Algorithm for Data Mining. Aarathi Raghu Advisor: Dr. Chris Pollett Committee members: Dr. Mark Stamp, Dr. T.Y.Lin

Implementation of Data Mining for Vehicle Theft Detection using Android Application

INFREQUENT WEIGHTED ITEM SET MINING USING NODE SET BASED ALGORITHM

Chapter 6: Basic Concepts: Association Rules. Basic Concepts: Frequent Patterns. (absolute) support, or, support. (relative) support, s, is the

Data Structures. Notes for Lecture 14 Techniques of Data Mining By Samaher Hussein Ali Association Rules: Basic Concepts and Application

INTELLIGENT SUPERMARKET USING APRIORI

Mining Frequent Patterns with Counting Inference at Multiple Levels

CS570 Introduction to Data Mining

Mining Frequent Patterns without Candidate Generation

Association Rule Mining

Chapter 4: Mining Frequent Patterns, Associations and Correlations

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING

A Technical Analysis of Market Basket by using Association Rule Mining and Apriori Algorithm

Appropriate Item Partition for Improving the Mining Performance

An Improved Apriori Algorithm for Association Rules

Introducing Hashing. Chapter 21. Copyright 2012 by Pearson Education, Inc. All rights reserved

Association Rule Mining: FP-Growth

Chapter 4: Association analysis:

Improved Frequent Pattern Mining Algorithm with Indexing

APRIORI ALGORITHM FOR MINING FREQUENT ITEMSETS A REVIEW

Mining of Web Server Logs using Extended Apriori Algorithm

Data Mining: Mining Association Rules. Definitions. .. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar..

A mining method for tracking changes in temporal association rules from an encoded database

A multi-step attack-correlation method with privacy protection

CHAPTER 5 WEIGHTED SUPPORT ASSOCIATION RULE MINING USING CLOSED ITEMSET LATTICES IN PARALLEL

Frequent Pattern Mining

MINING ASSOCIATION RULE FOR HORIZONTALLY PARTITIONED DATABASES USING CK SECURE SUM TECHNIQUE

An Algorithm for Mining Large Sequences in Databases

Discovery of Frequent Itemset and Promising Frequent Itemset Using Incremental Association Rule Mining Over Stream Data Mining

Association Rules. Berlin Chen References:

A New Technique to Optimize User s Browsing Session using Data Mining

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

Frequent Pattern Mining. Based on: Introduction to Data Mining by Tan, Steinbach, Kumar

Infrequent Weighted Itemset Mining Using SVM Classifier in Transaction Dataset

Teradata. This was compiled in order to describe Teradata and provide a brief overview of common capabilities and queries.

Web Usage Mining for Comparing User Access Behaviour using Sequential Pattern

A Survey on Moving Towards Frequent Pattern Growth for Infrequent Weighted Itemset Mining

The Transpose Technique to Reduce Number of Transactions of Apriori Algorithm

Improving the Efficiency of Web Usage Mining Using K-Apriori and FP-Growth Algorithm

Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey

DISCOVERING ACTIVE AND PROFITABLE PATTERNS WITH RFM (RECENCY, FREQUENCY AND MONETARY) SEQUENTIAL PATTERN MINING A CONSTRAINT BASED APPROACH

Data Mining for Knowledge Management. Association Rules

Part 1: Written Questions (60 marks):

BINARY DECISION TREE FOR ASSOCIATION RULES MINING IN INCREMENTAL DATABASES

CHAPTER V ADAPTIVE ASSOCIATION RULE MINING ALGORITHM. Please purchase PDF Split-Merge on to remove this watermark.

A Framework for Securing Databases from Intrusion Threats

Performance Based Study of Association Rule Algorithms On Voter DB

An Efficient Algorithm for Finding the Support Count of Frequent 1-Itemsets in Frequent Pattern Mining

ASSESSMENT LAYERED SECURITY

CSE 634/590 Data mining Extra Credit: Classification by Association rules: Example Problem. Muhammad Asiful Islam, SBID:

Classification by Association

A Novel method for Frequent Pattern Mining

Gurpreet Kaur 1, Naveen Aggarwal 2 1,2

Comparison of FP tree and Apriori Algorithm

Pattern Discovery Using Apriori and Ch-Search Algorithm

A NEW ASSOCIATION RULE MINING BASED ON FREQUENT ITEM SET

HTTP BASED BOT-NET DETECTION TECHNIQUE USING APRIORI ALGORITHM WITH ACTUAL TIME DURATION

Value Added Association Rules

SEQUENTIAL PATTERN MINING FROM WEB LOG DATA

Collaborative Rough Clustering

A Trie-based APRIORI Implementation for Mining Frequent Item Sequences

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE

Web Service Usage Mining: Mining For Executable Sequences

CLOSET+:Searching for the Best Strategies for Mining Frequent Closed Itemsets

Keshavamurthy B.N., Mitesh Sharma and Durga Toshniwal

Chapter 2. Related Work

Association Rule Mining. Entscheidungsunterstützungssysteme

A Study on Mining of Frequent Subsequences and Sequential Pattern Search- Searching Sequence Pattern by Subset Partition

Approaches for Mining Frequent Itemsets and Minimal Association Rules

signicantly higher than it would be if items were placed at random into baskets. For example, we

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University

FP-Growth algorithm in Data Compression frequent patterns

Hybrid Approach for Improving Efficiency of Apriori Algorithm on Frequent Itemset

Mining Rare Periodic-Frequent Patterns Using Multiple Minimum Supports

CHENNAI MATHEMATICAL INSTITUTE M.Sc. / Ph.D. Programme in Computer Science

Research Article Apriori Association Rule Algorithms using VMware Environment

Real-time Fraud Detection with Innovative Big Graph Feature. Gaurav Deshpande, VP Marketing, TigerGraph; Mingxi Wu, VP Engineering, TigerGraph

Mining Association Rules in Large Databases

SQL Based Frequent Pattern Mining with FP-growth

Faculty of Science FINAL EXAMINATION COMP-250 A Introduction to Computer Science School of Computer Science, McGill University

Chapter 13: Query Processing

Mining N-most Interesting Itemsets. Ada Wai-chee Fu Renfrew Wang-wai Kwong Jian Tang. fadafu,

Data Mining Concepts

DATA MINING II - 1DL460

INFREQUENT WEIGHTED ITEM SET MINING USING FREQUENT PATTERN GROWTH R. Lakshmi Prasanna* 1, Dr. G.V.S.N.R.V. Prasad 2

5. MULTIPLE LEVELS AND CROSS LEVELS ASSOCIATION RULES UNDER CONSTRAINTS

Market baskets Frequent itemsets FP growth. Data mining. Frequent itemset Association&decision rule mining. University of Szeged.

Chapter 12: Query Processing

Transcription:

CHAPTER-5 A HASH BASED APPROACH FOR FREQUENT PATTERN MINING TO IDENTIFY SUSPICIOUS FINANCIAL PATTERNS 54

CHAPTER-5 A HASH BASED APPROACH FOR FREQUENT PATTERN MINING TO IDENTIFY SUSPICIOUS FINANCIAL PATTERNS 5.1 INTRODUCTION Money Laundering in a criminal activity used to disguise black money as white money. The technology is getting advanced and in this fast changing technology, many merits as well as demerits are associated. The advent of e-commerce has globalized the world and with a single button click we can perform a huge amount of transaction. Detecting financial fraud is very important as it poses threat not only to financial institution but also to the nation. Traditional investigative techniques aimed at uncovering patterns consume numerous man-hours. Data mining techniques are well-suited for identifying trends and patterns in large datasets often comprised of hundreds or even thousands of complex hidden relationships. In spite of the guidelines listed by various governing bodies like, Reserve Bank of India, Securities and Exchange Board of India etc, many a times these are being violated. In India, all the banks need to submit the list of transactions which are not in line with the Reserve Bank of India guidelines to financial intelligence unit for further scrutinizing. Generally, the transactions pertaining to a bank may be either intrabank/interbank transaction, and the banks cannot request for any sort of investigations until unless if there is a foolproof system for identifying the money laundering activity. By considering the above facts, money laundering is considered to be a serious threat to the financial institutions as well as to the nation to carry illegal activities by hiding their personal identification. Although many anti money laundering techniques are proposed but failed to act efficiently. The current scenario is that all the anti money laundering solutions adopted are rule based which consume numerous man-hours. In Indian scenario an individual is considered, based on the guidelines given by reserve bank of India, banks determine few transactions which seem to be suspicious and send it to Financial Intelligence Unit (FIU). 55

FIU verifies if the transaction is actually suspicious or not. This process is very time consuming and not suitable to tackle dirty proceeds immediately. Hence it is very important to construct an efficient anti money laundering tool which goes very helpful for banks to report suspicious transactions. Hence this module aims to improve the efficiency of the existing anti money laundering techniques, this module aims at identifying the suspicious accounts in the layering stage of money laundering process by generating frequent transactional datasets using hash based mining. The generated frequent datasets will then be used in the graph theoretic approach to identify the traversal path of the suspicious transactions. The major idea of the system is to generate frequent 2-item set on the transactional database using hash based technique. After applying the hash based technique identifying the sequential traversal path using a graph theoretic approach among the suspicious accounts which were found in the frequent transactional data sets. The graph theoretic approach is applied to identify agent and integrator in the layering stage of money laundering The main purpose of this system is i) To prevent criminal elements from using the banking system for money laundering activities. ii) iii) To enable the bank to know/understand the customers and their financial dealings better which will in turn will help the bank to manage risks prudently. To put in place appropriate controls for detection and reporting of suspicious activities in accordance with applicable laws/laid down procedures. 5.2 METHDOLOGY The proposed system uses hashing technique to generate frequent accounts. We are working on the transactional data from multiple banks. Hence each individual bank s data that is stored in db1, db2...so on are taken together and combined to form a single 56

large database. Now the data of this large database has to be pre-processed in order to obtain data which is free from all null and missing or incorrect values. A hash based technique is applied on the transactional dataset to obtain a candidate set of reduced size. From this reduced size of candidate set we obtain frequent-2 item set. Now these frequent-2 item sets forms the edges of the graph. On applying the algorithm longest path in a directed acyclic graph we obtain the path in which large amount has been transferred. On the basis of in-degree and out-degree of each node, we determine agent and integrator. 5.2.1: Applying hash-based technique over apriori algorithm A. Apriori algorithm Association analysis is used to find the relationship among the data elements and determining association rules. Some of the important association rule mining algorithms are apriori and hash based approach. They are used to find the associations using the minimum support and minimum confidence. The association analysis is divided into two sub problems. One is to find the accounts whose happening occurs behind the threshold and the second one is generating association rules over large databases with the constraints of minimum confidence. Apriori algorithm works well only if the data base is small and contains less number of transactions. The join indexing will helps in identifying the link that exists among the suspicious transaction but unable to establish the associations that exists among them. When the apriori algorithm is applied by considering the apriori property that Every subset of an account set must be frequent [108]. Using this principle a frequent item set is generated. The process of apriori algorithm works in this way. 57

Apriori algorithm for discovering frequent accounts 1. procedure apriori(t, minsupport) 2. { 3. // t is the database of suspicious transaction occurred between accounts(stoa) and minsuport is S 4. l 1 ={frequent st}; 5. for (k=2; l k -1!=Ø;k++) 6. { 7. c k =candidate generated from l k -1 8. // that is cartesian product l k -1 x l k -1 & eliminating any k-1st that is not frequent 9. minsupport =s; 10. set-consists=2 11. while(support value of all transactions>s) 12. { 13. generate frequent st of size(set_consists+1); 14. set_consists ++; 15. calculate support values; 16. } 17. end 18. return U k l k // l k frequent accounts of size k Finally, all of those candidates satisfying minimum support form the set of frequent accounts, l 58

Applying the apriori algorithm on the result set that is derived from join indexing. The procedure for generating frequent transaction set is described below by considering a small financial data set consisting of transactions.the list of transaction that is found after applying join indexing is shown in the table 5.1. Table-5.1: List of transactions from join indexing UID List of Account IDs UID List of Account IDs UID List of Account IDs UID List of Account IDs T 1 39, 12 T 7 43, 16 T 13 43, 16 T 19 39, 12 T 2 15, 12 T 8 12,16 T 14 39, 12 T 20 15, 12 T 3 43, 16 T 9 16, 19 T 15 39, 12 T 21 22, 39 T 4 16, 22 T 10 39, 12 T 16 15, 12 T 22 12, 16 T 5 39, 12 T 11 15, 12 T 17 22, 39 T 23 43, 16 T 6 15, 12 T 12 22, 39 T 18 12, 16 T 24 43, 16 The procedure for generating frequent transaction set is described below. Step -1: In the first step simply scrutinize all of the transactions in order to count the no of occurrence of each account id. Table 5.2:No.of occurrences of each account ids List of STOA s Support count 12,15 5 12,16 3 12,39 6 12,43 0 15,16 0 15,39 0 15,43 0 16,39 0 16,43 5 39,43 0 59

Step 2: Considering the minimum support count = 3, the frequent STOA s are Table 5.3: List of STOA S List of account ID s Support count 12,15 5 12,16 3 12,39 6 16,43 5 Step 3: From the derived 2-itemset and using the modified apriori algorithm a 3-itemset is derived Table5.4: Generated (3-itemset) after applying apriori algorithm List of account ID s 12,15,39 12,16,43 Further generation of association rules are not possible due to the non availability of information. The financial database consists of only 2 item set associations and applying apriori algorithm we can only generate 3 item set. Apriori algorithm works well if there exists a chain of associations from the transactional account set, but the situation is different in case of financial transactions. Any financial transaction is between two players but not between many. The apriori algorithm has some drawbacks in reducing the number of candidate k itemsets. In particular the 2 item sets since it is the key in improving the performance we used the hash based technique to improve the performance. 60

B. Hash based technique: This technique is used to reduce the candidate k-items, ck, for k>1. The formula for hash function used here for creating hash table h(x,y) = ((order of x)*10+(order of y))mod 7.for example when scanning each transaction in the data base to generate the frequent 1 item sets,l1,from the candidate 1-item set in c1,we can generate all of these 2 item sets for each transaction and map them into various buckets of a hash table structure and increase the corresponding bucket counts and the process continues. 5.2.2: Identifying suspicious transactions path using graph theoretic approach To resolve this situation in the hash based approach and to further investigate the flow of money, a graph theoretic approach is proposed. A graph is an ordered pair G= (V, E) comprising a set V of vertices or nodes together with a set of edges or lines [17].We have different types of graphs such as simple graph where the non empty subsets of vertices are connected at most by one edge and the multi graphs are used for allowing the multiple edges between two vertices and the pseudo graphs are the graphs which allows edges connected to the vertex itself. From these we can differentiate directed graph and undirected graph. A directed graph is a graph in which there exists a direction which links the vertices, on the other hand undirected graph is the graph there won t be any direction between the vertices. In this proposed system a directed graph G= (V, E), the node V is considered as account and E comprised of associations between two or more accounts. 61

5.2.3 Algorithm for the construction of graph for identifying the path 1. Read the transaction details derived from hash based algorithm 2. Add account numbers as vertices in the graph 3. Now join vertices if there is transaction between accounts 4. Now find in degree and out degree of all vertices 5. The vertex with in degree as zero is source vertex represents agent in the placement phase of money laundering and vertex with out degree as zero is destination vertex represents integrator in the integration phase of money laundering. 6. The all possible paths between agent and integrator will give us layering information. Linking all the transactions sequentially and generating a graph by considering each account in the frequent item set as a node. For each link between the transaction, assign weights to reflect the multiplicity of the occurrence and hence the strength of the path. Finding the in-degree and out-degree of each node and determining agent and integrator. 62

5.3 IMPLEMENTATION Hash based technique over apriori algorithm: A hash based technique can be used to reduce the size of the candidate k-item sets, ck, for k>1. This is because in this technique we apply a hash function to each of the itemset of the transaction. h(x,y)= ((order of x*10)+order of y)mod 7 Suppose we have an item set {A1,A4} Then x=1 and y=4. Hence h(1,4)= ((1*10)+4)mod 7=14 mod 7=0. Now we place {A1,A4} in bucket address 0. Like wise we fill the hash table and record the bucket count. If any bucket is having count less than the minimum support count, then that whole bucket (i.e, its entire contents) is discarded) All the undeleted bucket counts now form elements of candidate set. Thus now we have a candidate item set which is smaller in size and hence we need to scan the database less number of times to find the frequent item sets thereby improving the efficiency of apriori algorithm. Candidate 2-item set generation: All the contents of the undeleted hash table contents are copied and then the duplicate transactions are eliminated. Then we obtain candidate 2 item set. Transitivity relation As at a time only 2 accounts are involved in a transaction, to find the chaining of accounts, we have used the mathematical transitivity relation, i.e., if A->B and B->C, then A->B->C Frequent 3 Item sets 63

From the transitivity relation we obtain 3 item sets. These item sets have the amount associated with it. Generating a sequential traversal path: From the frequent accounts, we can create the edges of the graph and also the weight of each edge is equal to the amount transferred between those two accounts. Longest path in a directed acyclic graph There are many paths in the graph. Now to find the most suspicious path, we are applying this algorithm and getting the path with the total amount. To understand the approach, let us consider the dataset of 22 transactions. Generating frequent accounts using hashing Consider a small transaction dataset of 22 transactions Table No-5.5: Dataset contents. Transaction_ID From-to transaction 2-item set 1 A1->A2 {1,2} 2 A2->A3 {2,3} 3 A3->A4 {3,4} 4 A1->A4 {1,4} 5 A4->A6 {4,6} 6 A5->A6 {5,6} 7 A3->A5 {3,5} 8 A3->A6 {3,6} 9 A4->A5 {4,5} 10 A1->A2 {1,2} 11 A5->A6 {5,6} 12 A3->A5 {3,5} 13 A3->A6 {3,6} 14 A1->A2 {1,2} 64

15 A3->A5 {3,5} 16 A3->A6 {3,6} 17 A4->A5 {4,5} 18 A1->A2 {1,2} 19 A3->A5 {3,5} 20 A4->A5 {4,5} 21 A3->A4 {3,4} 22 A2->A3 {2,3} On this set of 22 transactions hash formula is applied. H(x,y)=((order of x)*10)+ (order of y)) mod 7. Here x= from_acc_d and y=to_acc_id Now all these 22 transactions are grouped in to different indexes in hash table. Now the bucket count is calculated for each bucket Table No 5.6: Bucket tables with bucket counts Bucket address 0 1 2 3 4 5 6 Bucket 1,4 3,6 2,3 4,5 4,6 1,2 3,4 contents 5,6 3,6 2,3 4,5 1,2 3,4 5,6 3,6 4,5 1,2 3,5 1,2 3,5 3,5 3,5 Bucket count 7 3 2 3 1 4 2 65

Enter the minimum bucket count Then the buckets whose total count is less than the deleted with all its contents. Here bucket 4 is deleted. minimum bucket will be Minimum bucket count=2 Table No-5.7 Bucket count for item sets and minimum support count Item set Bucket count 1,4 7 5,6 7 3,5 7 3,6 3 2,3 2 4,5 3 4,6 1 (*discarded) 1,2 4 3,4 2 Now the left over transactions in the buckets are taken and then their actual count in database is recorded 66

Table No-5.8: The bucket count and actual count are recorded Item sets Bucket count Actual count 1,4 7 1 (*discarded) 5,6 7 2 3,5 7 4 3,6 3 3 2,3 2 2 4,5 3 3 1,2 4 4 3,4 2 2 Enter a support count for the no of time of transaction. (say 2) Minimum Support Count =2 Now all the transactions which have occurred 2 or more no of times are taken in to Frequent -2 item sets Table No-5.9: Frequent 2 accounts with their support counts Frequent-2 Item set Support count 5,6 2 3,5 4 3,6 3 2,3 2 4,5 3 1,2 4 3,4 2 67

These are the frequent-2 transactions. Finding the traversal path: Various paths are identified by connecting all the frequent accounts as nodes. A4 A6 Out degree=0 Integrator W 45 =3 W 56 =2 InDegree=0 Agent W 34 =2 A5 W 36 =3 W 35 =4 A3 W 23 =2 A2 W 32 =4 A1 Fig No- 5.1: The graph of suspicious accounts Some of the packages used are: java.io.*: Java IO is an API that comes with Java which is targeted at reading and writing data (input and output). java.util.iterator : To generate successive elements from a series, we can use java iterator. java.util.vector: The Vector class implements a growable array of objects. Like an array, it contains components that can be accessed using an integer index. However, the size of 68

a Vector can grow or shrink as needed to accommodate adding and removing items after the Vector has been created. java.sql.* Provides the API for accessing and processing data stored in a data source (usually a relational database) using the Java TM programming language. This API includes a framework whereby different drivers can be installed dynamically to access different data sources. Java.util.scanner : The java.util. Scanner class is a simple text scanner which can parse primitive types and strings using regular expressions. Database We have maintained the databases in sql server management studio. For this we have created tables using sql queries. Dataset We have 4 datasets. 1) TwentyTwo - having twenty two transactions. 2) FiveThousand having FiveThousand transactions. 3) TenThousand having TenThousand transactions. 4) SeventeenThousand having SeventeenThousand transactions. These four datasets are created by creating four tables for Transactions with same attributes but with different no of records. Tables: 1) Bank 2) Customer 3)Accounts 4)Transactions All the data that is inserted into these tables are synthetic data and they are the data that is free from null values and missing values. Four transaction tables are created to store the varied size of dataset. 69

The tables created have a primary key associated with it. Bank table has bank_id as primary key, Customer table has customer_id as primary key Account table has account_id as primary key Transaction table has trans_id as primary key Table insertion: Example queries: insert into bank values('sbi','mvp','visakapatnam','a.p') insert into customer values('bharath kumar chowhan','it.employee','6','aaxpd7874l','97788593','ranga reddy','1980-01- 01','male') insert into account values('56638790','6','1','1998-01-01','162557.00') insert into transactions values(103,51,'12/1/2013 9:00:00 AM',20173,'initiated',null,null) 5.4 SUMMARY By considering the different sizes of the synthetic data sets of 20000 transactions we could address the issue of detecting suspicious accounts using the existing anti-money laundering techniques. We are successful in identifying the suspicious accounts in the layering stage of money laundering process by generating frequent transactional datasets using hash based mining. Further we were also able to identify the traversal path of the suspicious transactions using the longest path in a directed acyclic graph. The graph theory with which we examined the degree of each node is then considered as our basis to identify the agent and integrator. 70