CHAPTER-5 A HASH BASED APPROACH FOR FREQUENT PATTERN MINING TO IDENTIFY SUSPICIOUS FINANCIAL PATTERNS 54
CHAPTER-5 A HASH BASED APPROACH FOR FREQUENT PATTERN MINING TO IDENTIFY SUSPICIOUS FINANCIAL PATTERNS 5.1 INTRODUCTION Money Laundering in a criminal activity used to disguise black money as white money. The technology is getting advanced and in this fast changing technology, many merits as well as demerits are associated. The advent of e-commerce has globalized the world and with a single button click we can perform a huge amount of transaction. Detecting financial fraud is very important as it poses threat not only to financial institution but also to the nation. Traditional investigative techniques aimed at uncovering patterns consume numerous man-hours. Data mining techniques are well-suited for identifying trends and patterns in large datasets often comprised of hundreds or even thousands of complex hidden relationships. In spite of the guidelines listed by various governing bodies like, Reserve Bank of India, Securities and Exchange Board of India etc, many a times these are being violated. In India, all the banks need to submit the list of transactions which are not in line with the Reserve Bank of India guidelines to financial intelligence unit for further scrutinizing. Generally, the transactions pertaining to a bank may be either intrabank/interbank transaction, and the banks cannot request for any sort of investigations until unless if there is a foolproof system for identifying the money laundering activity. By considering the above facts, money laundering is considered to be a serious threat to the financial institutions as well as to the nation to carry illegal activities by hiding their personal identification. Although many anti money laundering techniques are proposed but failed to act efficiently. The current scenario is that all the anti money laundering solutions adopted are rule based which consume numerous man-hours. In Indian scenario an individual is considered, based on the guidelines given by reserve bank of India, banks determine few transactions which seem to be suspicious and send it to Financial Intelligence Unit (FIU). 55
FIU verifies if the transaction is actually suspicious or not. This process is very time consuming and not suitable to tackle dirty proceeds immediately. Hence it is very important to construct an efficient anti money laundering tool which goes very helpful for banks to report suspicious transactions. Hence this module aims to improve the efficiency of the existing anti money laundering techniques, this module aims at identifying the suspicious accounts in the layering stage of money laundering process by generating frequent transactional datasets using hash based mining. The generated frequent datasets will then be used in the graph theoretic approach to identify the traversal path of the suspicious transactions. The major idea of the system is to generate frequent 2-item set on the transactional database using hash based technique. After applying the hash based technique identifying the sequential traversal path using a graph theoretic approach among the suspicious accounts which were found in the frequent transactional data sets. The graph theoretic approach is applied to identify agent and integrator in the layering stage of money laundering The main purpose of this system is i) To prevent criminal elements from using the banking system for money laundering activities. ii) iii) To enable the bank to know/understand the customers and their financial dealings better which will in turn will help the bank to manage risks prudently. To put in place appropriate controls for detection and reporting of suspicious activities in accordance with applicable laws/laid down procedures. 5.2 METHDOLOGY The proposed system uses hashing technique to generate frequent accounts. We are working on the transactional data from multiple banks. Hence each individual bank s data that is stored in db1, db2...so on are taken together and combined to form a single 56
large database. Now the data of this large database has to be pre-processed in order to obtain data which is free from all null and missing or incorrect values. A hash based technique is applied on the transactional dataset to obtain a candidate set of reduced size. From this reduced size of candidate set we obtain frequent-2 item set. Now these frequent-2 item sets forms the edges of the graph. On applying the algorithm longest path in a directed acyclic graph we obtain the path in which large amount has been transferred. On the basis of in-degree and out-degree of each node, we determine agent and integrator. 5.2.1: Applying hash-based technique over apriori algorithm A. Apriori algorithm Association analysis is used to find the relationship among the data elements and determining association rules. Some of the important association rule mining algorithms are apriori and hash based approach. They are used to find the associations using the minimum support and minimum confidence. The association analysis is divided into two sub problems. One is to find the accounts whose happening occurs behind the threshold and the second one is generating association rules over large databases with the constraints of minimum confidence. Apriori algorithm works well only if the data base is small and contains less number of transactions. The join indexing will helps in identifying the link that exists among the suspicious transaction but unable to establish the associations that exists among them. When the apriori algorithm is applied by considering the apriori property that Every subset of an account set must be frequent [108]. Using this principle a frequent item set is generated. The process of apriori algorithm works in this way. 57
Apriori algorithm for discovering frequent accounts 1. procedure apriori(t, minsupport) 2. { 3. // t is the database of suspicious transaction occurred between accounts(stoa) and minsuport is S 4. l 1 ={frequent st}; 5. for (k=2; l k -1!=Ø;k++) 6. { 7. c k =candidate generated from l k -1 8. // that is cartesian product l k -1 x l k -1 & eliminating any k-1st that is not frequent 9. minsupport =s; 10. set-consists=2 11. while(support value of all transactions>s) 12. { 13. generate frequent st of size(set_consists+1); 14. set_consists ++; 15. calculate support values; 16. } 17. end 18. return U k l k // l k frequent accounts of size k Finally, all of those candidates satisfying minimum support form the set of frequent accounts, l 58
Applying the apriori algorithm on the result set that is derived from join indexing. The procedure for generating frequent transaction set is described below by considering a small financial data set consisting of transactions.the list of transaction that is found after applying join indexing is shown in the table 5.1. Table-5.1: List of transactions from join indexing UID List of Account IDs UID List of Account IDs UID List of Account IDs UID List of Account IDs T 1 39, 12 T 7 43, 16 T 13 43, 16 T 19 39, 12 T 2 15, 12 T 8 12,16 T 14 39, 12 T 20 15, 12 T 3 43, 16 T 9 16, 19 T 15 39, 12 T 21 22, 39 T 4 16, 22 T 10 39, 12 T 16 15, 12 T 22 12, 16 T 5 39, 12 T 11 15, 12 T 17 22, 39 T 23 43, 16 T 6 15, 12 T 12 22, 39 T 18 12, 16 T 24 43, 16 The procedure for generating frequent transaction set is described below. Step -1: In the first step simply scrutinize all of the transactions in order to count the no of occurrence of each account id. Table 5.2:No.of occurrences of each account ids List of STOA s Support count 12,15 5 12,16 3 12,39 6 12,43 0 15,16 0 15,39 0 15,43 0 16,39 0 16,43 5 39,43 0 59
Step 2: Considering the minimum support count = 3, the frequent STOA s are Table 5.3: List of STOA S List of account ID s Support count 12,15 5 12,16 3 12,39 6 16,43 5 Step 3: From the derived 2-itemset and using the modified apriori algorithm a 3-itemset is derived Table5.4: Generated (3-itemset) after applying apriori algorithm List of account ID s 12,15,39 12,16,43 Further generation of association rules are not possible due to the non availability of information. The financial database consists of only 2 item set associations and applying apriori algorithm we can only generate 3 item set. Apriori algorithm works well if there exists a chain of associations from the transactional account set, but the situation is different in case of financial transactions. Any financial transaction is between two players but not between many. The apriori algorithm has some drawbacks in reducing the number of candidate k itemsets. In particular the 2 item sets since it is the key in improving the performance we used the hash based technique to improve the performance. 60
B. Hash based technique: This technique is used to reduce the candidate k-items, ck, for k>1. The formula for hash function used here for creating hash table h(x,y) = ((order of x)*10+(order of y))mod 7.for example when scanning each transaction in the data base to generate the frequent 1 item sets,l1,from the candidate 1-item set in c1,we can generate all of these 2 item sets for each transaction and map them into various buckets of a hash table structure and increase the corresponding bucket counts and the process continues. 5.2.2: Identifying suspicious transactions path using graph theoretic approach To resolve this situation in the hash based approach and to further investigate the flow of money, a graph theoretic approach is proposed. A graph is an ordered pair G= (V, E) comprising a set V of vertices or nodes together with a set of edges or lines [17].We have different types of graphs such as simple graph where the non empty subsets of vertices are connected at most by one edge and the multi graphs are used for allowing the multiple edges between two vertices and the pseudo graphs are the graphs which allows edges connected to the vertex itself. From these we can differentiate directed graph and undirected graph. A directed graph is a graph in which there exists a direction which links the vertices, on the other hand undirected graph is the graph there won t be any direction between the vertices. In this proposed system a directed graph G= (V, E), the node V is considered as account and E comprised of associations between two or more accounts. 61
5.2.3 Algorithm for the construction of graph for identifying the path 1. Read the transaction details derived from hash based algorithm 2. Add account numbers as vertices in the graph 3. Now join vertices if there is transaction between accounts 4. Now find in degree and out degree of all vertices 5. The vertex with in degree as zero is source vertex represents agent in the placement phase of money laundering and vertex with out degree as zero is destination vertex represents integrator in the integration phase of money laundering. 6. The all possible paths between agent and integrator will give us layering information. Linking all the transactions sequentially and generating a graph by considering each account in the frequent item set as a node. For each link between the transaction, assign weights to reflect the multiplicity of the occurrence and hence the strength of the path. Finding the in-degree and out-degree of each node and determining agent and integrator. 62
5.3 IMPLEMENTATION Hash based technique over apriori algorithm: A hash based technique can be used to reduce the size of the candidate k-item sets, ck, for k>1. This is because in this technique we apply a hash function to each of the itemset of the transaction. h(x,y)= ((order of x*10)+order of y)mod 7 Suppose we have an item set {A1,A4} Then x=1 and y=4. Hence h(1,4)= ((1*10)+4)mod 7=14 mod 7=0. Now we place {A1,A4} in bucket address 0. Like wise we fill the hash table and record the bucket count. If any bucket is having count less than the minimum support count, then that whole bucket (i.e, its entire contents) is discarded) All the undeleted bucket counts now form elements of candidate set. Thus now we have a candidate item set which is smaller in size and hence we need to scan the database less number of times to find the frequent item sets thereby improving the efficiency of apriori algorithm. Candidate 2-item set generation: All the contents of the undeleted hash table contents are copied and then the duplicate transactions are eliminated. Then we obtain candidate 2 item set. Transitivity relation As at a time only 2 accounts are involved in a transaction, to find the chaining of accounts, we have used the mathematical transitivity relation, i.e., if A->B and B->C, then A->B->C Frequent 3 Item sets 63
From the transitivity relation we obtain 3 item sets. These item sets have the amount associated with it. Generating a sequential traversal path: From the frequent accounts, we can create the edges of the graph and also the weight of each edge is equal to the amount transferred between those two accounts. Longest path in a directed acyclic graph There are many paths in the graph. Now to find the most suspicious path, we are applying this algorithm and getting the path with the total amount. To understand the approach, let us consider the dataset of 22 transactions. Generating frequent accounts using hashing Consider a small transaction dataset of 22 transactions Table No-5.5: Dataset contents. Transaction_ID From-to transaction 2-item set 1 A1->A2 {1,2} 2 A2->A3 {2,3} 3 A3->A4 {3,4} 4 A1->A4 {1,4} 5 A4->A6 {4,6} 6 A5->A6 {5,6} 7 A3->A5 {3,5} 8 A3->A6 {3,6} 9 A4->A5 {4,5} 10 A1->A2 {1,2} 11 A5->A6 {5,6} 12 A3->A5 {3,5} 13 A3->A6 {3,6} 14 A1->A2 {1,2} 64
15 A3->A5 {3,5} 16 A3->A6 {3,6} 17 A4->A5 {4,5} 18 A1->A2 {1,2} 19 A3->A5 {3,5} 20 A4->A5 {4,5} 21 A3->A4 {3,4} 22 A2->A3 {2,3} On this set of 22 transactions hash formula is applied. H(x,y)=((order of x)*10)+ (order of y)) mod 7. Here x= from_acc_d and y=to_acc_id Now all these 22 transactions are grouped in to different indexes in hash table. Now the bucket count is calculated for each bucket Table No 5.6: Bucket tables with bucket counts Bucket address 0 1 2 3 4 5 6 Bucket 1,4 3,6 2,3 4,5 4,6 1,2 3,4 contents 5,6 3,6 2,3 4,5 1,2 3,4 5,6 3,6 4,5 1,2 3,5 1,2 3,5 3,5 3,5 Bucket count 7 3 2 3 1 4 2 65
Enter the minimum bucket count Then the buckets whose total count is less than the deleted with all its contents. Here bucket 4 is deleted. minimum bucket will be Minimum bucket count=2 Table No-5.7 Bucket count for item sets and minimum support count Item set Bucket count 1,4 7 5,6 7 3,5 7 3,6 3 2,3 2 4,5 3 4,6 1 (*discarded) 1,2 4 3,4 2 Now the left over transactions in the buckets are taken and then their actual count in database is recorded 66
Table No-5.8: The bucket count and actual count are recorded Item sets Bucket count Actual count 1,4 7 1 (*discarded) 5,6 7 2 3,5 7 4 3,6 3 3 2,3 2 2 4,5 3 3 1,2 4 4 3,4 2 2 Enter a support count for the no of time of transaction. (say 2) Minimum Support Count =2 Now all the transactions which have occurred 2 or more no of times are taken in to Frequent -2 item sets Table No-5.9: Frequent 2 accounts with their support counts Frequent-2 Item set Support count 5,6 2 3,5 4 3,6 3 2,3 2 4,5 3 1,2 4 3,4 2 67
These are the frequent-2 transactions. Finding the traversal path: Various paths are identified by connecting all the frequent accounts as nodes. A4 A6 Out degree=0 Integrator W 45 =3 W 56 =2 InDegree=0 Agent W 34 =2 A5 W 36 =3 W 35 =4 A3 W 23 =2 A2 W 32 =4 A1 Fig No- 5.1: The graph of suspicious accounts Some of the packages used are: java.io.*: Java IO is an API that comes with Java which is targeted at reading and writing data (input and output). java.util.iterator : To generate successive elements from a series, we can use java iterator. java.util.vector: The Vector class implements a growable array of objects. Like an array, it contains components that can be accessed using an integer index. However, the size of 68
a Vector can grow or shrink as needed to accommodate adding and removing items after the Vector has been created. java.sql.* Provides the API for accessing and processing data stored in a data source (usually a relational database) using the Java TM programming language. This API includes a framework whereby different drivers can be installed dynamically to access different data sources. Java.util.scanner : The java.util. Scanner class is a simple text scanner which can parse primitive types and strings using regular expressions. Database We have maintained the databases in sql server management studio. For this we have created tables using sql queries. Dataset We have 4 datasets. 1) TwentyTwo - having twenty two transactions. 2) FiveThousand having FiveThousand transactions. 3) TenThousand having TenThousand transactions. 4) SeventeenThousand having SeventeenThousand transactions. These four datasets are created by creating four tables for Transactions with same attributes but with different no of records. Tables: 1) Bank 2) Customer 3)Accounts 4)Transactions All the data that is inserted into these tables are synthetic data and they are the data that is free from null values and missing values. Four transaction tables are created to store the varied size of dataset. 69
The tables created have a primary key associated with it. Bank table has bank_id as primary key, Customer table has customer_id as primary key Account table has account_id as primary key Transaction table has trans_id as primary key Table insertion: Example queries: insert into bank values('sbi','mvp','visakapatnam','a.p') insert into customer values('bharath kumar chowhan','it.employee','6','aaxpd7874l','97788593','ranga reddy','1980-01- 01','male') insert into account values('56638790','6','1','1998-01-01','162557.00') insert into transactions values(103,51,'12/1/2013 9:00:00 AM',20173,'initiated',null,null) 5.4 SUMMARY By considering the different sizes of the synthetic data sets of 20000 transactions we could address the issue of detecting suspicious accounts using the existing anti-money laundering techniques. We are successful in identifying the suspicious accounts in the layering stage of money laundering process by generating frequent transactional datasets using hash based mining. Further we were also able to identify the traversal path of the suspicious transactions using the longest path in a directed acyclic graph. The graph theory with which we examined the degree of each node is then considered as our basis to identify the agent and integrator. 70