Mining Vehicles Frequently Appearing Together from Massive Passing Records

Appl. Math. Inf. Sc. 9, No. 3, 1427-1433 (2015) 1427 Appled Mathematcs & Informaton Scences An Internatonal Journal http://dx.do.org/10.12785/ams/090337 Mnng Vehcles Frequently Appearng Together from Massve Passng Records Dongjn Yu 1,, Wensheng Dou 1, Wanqng L 1, Suhang Zheng 1 and Janhua Shao 2,3 1 Hangzhou Danz Unversty, Hangzhou, Chna 2 Zhejang Topcheer Informaton Technology Co., Ltd, Hangzhou, Chna 3 Zhejang Provncal Key Laboratory of Network Technology and Informaton Securty, Chna Receved: 7 Aug. 2014, Revsed: 8 Nov. 2014, Accepted: 9 Nov. 2014 Publshed onlne: 1 May 2015 Abstract: Vehcles Frequently Appearng Together, or VFATs, can be clues n solvng crmnal cases. Tradtonal sequence mnng approaches help dentfy VFATs from passng-through records collected at montorng stes. However, huge traffc data streams hnder fast dentfcaton of VFATs. In ths paper, we present a mult-threaded approach to fast dentfcaton of VFATs based on mult-core processors, called Frequent Sequental Mnng based on Mult-Cores (FSMMC). It parallels the executon of tasks, parttons large volumes of data, and obtans VFATs by mergng local canddates dscovered n dfferent threads runnng on dfferent processor cores. Through local parallel reducton, FSMMC elmnates the repettve patterns and reduces computatonal effort. Moreover, t acheves workload balance by the dynamc dstrbuton of tasks to a pool of threads where the thread that fnshes frst jons another runnng thread. Both theoretcal analyss and case studes show that FSMMC takes full advantage of mult-core computng platforms and has hgher speed-up when searchng VFATs among massve passng through records, compared wth other approaches wthout multthreadng. Keywords: massve data mnng, parallel, sequental patterns, mult-core, Vehcles Frequently Appearng Together 1 Introducton When solvng crmnal cases, Vehcles Frequently Appearng Together, or VFATs, can sometmes be valuable clues. Collectng records on vehcles passng through from dfferent montorng stes and then searchng for vehcles frequently appearng together has been proven to be an effectve manner to fnd VFATs. However, such nvestgaton always nvolves large traffc streams and therefore takes a long tme. Moreover, VFATs have hgh moblty and can usually escape notce. How to quckly dentfy VFATs from massve traffc data streams therefore becomes a key ssue. In recent years, varous methods of data mnng have matured and been appled wdely n varous felds, ncludng the dscovery of motfs n DNA sequences, the analyss of web log and customer shoppng sequences, and study of XML query access patterns [1]. Frequent pattern dscovery or sequental mnng, whch was poneered by the works of Agrawal et al. n the Apror algorthm [2], could be used to fnd VFATs. The problem wth frequent patterns, gven a mnmum support threshold mn sup, s n dscoverng all the tem sets that occur at least mn suptmes n the database. Here, vehcles frequently appearng together can be regarded as frequent patterns,.e., they often appear somewhere as a whole. Hgh-performance computaton utltes, such as mult-core and many-core servers, offer deal mnng platforms. The problem n fndng VFATs s therefore how to fully explot the parallelsm, or harness the power of these mult-core processors. A number of works have focused on parallel formulatons for fndng frequent patterns on shared-memory computers and GPU nodes [3, 4]. Indeed, many parallel computng models already exst. For example, OpenMP s a well-known parallel framework supportng mult-platform shared-memory parallel programmng n C/C++. Although OpenMP s smple to use because of ts automatc data layout and decomposton by drectves, t lacks relable error handlng and fne-graned mechansms to control thread-processor mappng. In ths paper, we present a novel parallel frequent sequental mnng approach Correspondng author e-mal: yudj@hdu.edu.cn Natural Scences Publshng Cor.

1428 DJ. Yu et. al. : Mnng Vehcles Frequently Appearng Together... employng mult-cores called FSMMC (Frequent Sequental Mnng based on Mult-Cores) to search for VFATs. FSMMC takes threads as the parallel unt and can mnmze memory bandwdth and maxmze cache reuse. Both theoretcal analyss and case studes ndcate that t s an effcent mult-core mplementaton. The structure of the paper s as follows. In secton 2, we brefly defne the problem of extractng frequent patterns wth mult-core processors, descrbe the approach n depth, and provde the necessary theoretcal background. In secton 3, we theoretcally evaluate the performance of the approach; we thus prove that the method s precse, wth lower calculaton complexty and more feasblty. Secton 4 presents the detaled expermental results, comparng the approach wth dfferent numbers of runnng threads on a mult-core processor. In secton 5 we then revew the current state of parallel pattern mnng technology. Fnally, secton 6 concludes the paper and gves drectons for future work. 2 The FSMMC Approach 2.1 Defntons Defnton 1 The temset of vehcle passng record n a gven database D, s denoted as a quadruple,.e., I p,t,l,d =< p,t,l,d >, n whch p represents the vehcle plate number, t represents the tme of passng through montorng stes, l represents the locaton of montorng stes, and d represents the vehcle drvng drecton. Defnton 2 A sequence s s an ordered lst of temsets n a perod of tme D t, denoted as S=< I p1,t 1,l 1,d 1,I p2,t 2,l 2,d 2,...,I pm,t m,l m,d m >. Defnton 3 The term motorcade sequence s used to represent the group of vehcles passng through the same montorng stes n the same drecton n the tme of nterval t, denoted as: ; S m = (p 1,..., p,..., p j,..., p n ) (1) < I p1,t 1,l,d,...,I p,t,l,d,...,i p j,t j,l,d,...,i pn,t n,l,d > S t t t j,n {1,...,m} l {l 1,...,l m },d {d 1,...,d m } The length of S m, denoted as S m, s the number of the temsets S m holds. Defnton 4 Gven a database D storng the vehcle passng records, the support of the sequence S m, or the relablty as the suspect VFAT, s denoted as sup(s m ) = S t S l /D t, where S t denotes the count of S m occurrng, and s l denotes the number of passng montorng stes S m covers. Defnton 5 Gven a mnmum support threshold mn sup, f sup(s m ) mn sup, S m s then called a frequent sequence S,.e., VFATs. The collecton of S s denoted as L N when S =N. Defnton 6 Gven a database D storng the vehcle passng records, all S m t holds are called canddate sequences, denoted as C N when S m = N. In other words,l N ={C N sup(c N ) mn sup}. Defnton 7 In the mult-thread envronment, the subsequence of C N acqured on thread s called CN. There exsts n =1 C N = C N, where n s the total number of runnng threads. To dstngush dfferent motorcade sequences from CN,V S m s used to represent the set of C CN N when ts sequental value s S m. Defnton 8 Gven a sequence database D storng the vehcle passng records, let T s be the seral sequence mnng tme wth a sngle-core processor, and let T(q) be the parallel sequence mnng tme wth q-core processors. The speed-up s then defned as S(q)=T s /T(q). 2.2 The FSMMC Approach The FSMMC approach s desgned to be executed on a shared memory system. It parttons the workload nto ndependent tasks, but assumes that the whole dataset s accessble to all threads. In ths way, each thread runs ndependently through lock-free programmng wthout the need for nter-thread communcaton. In order to combne the propertes of mult-core processors, the FSMMC approach can be further dvded nto three phases: 1) the global database s dvded nto several local datasets for each thread by means of the equdstant statc projecton method; 2) local motorcade sequental patterns are located n each thread by local parallel reducton; 3) local motorcade patterns are dynamcally combned nto the frequent sequental patterns. These phases are llustrated n Fgure 1. 1) The complete database D s parttoned to D and assgned to the thread (=1,2,...,n) for loadng. The global database D s dvded nto D 1,D 2,...,D n, and D= n =1 D. If there are R records n D, then the records R for thread are shown as (2). Here, M j represents record j n the local database D and T p represents record p n the global database D. { R = M j M j = T p, p= n R ( 1)+ j, p [ n R ( 1), nr ]} (2) 2) For thread (=1,2,...,n), the local database s scanned once to fnd all motorcade sequental patterns; where necessary, they are then reduced and stored n fles. Because D s always too large to store n memory wholly, FSMMC needs further dvson to get Natural Scences Publshng Cor.

Appl. Math. Inf. Sc. 9, No. 3, 1427-1433 (2015) / www.naturalspublshng.com/journals.asp 1429 as: T = m j=1 Q j,=1,2,...,n (3) On the other hand, through the parallel computaton on mult-cores, the tme for generatng all the motorcade sequental patterns s: ( ) m T 2 = Max(T )=Max j=1 Q j,=1,2,...,n (4) Fg. 1: The process of FSMMC approach. smaller datasets D. Then, t spawns n threads, each scannng D to get canddate sequences. Consderng ths step s one of the most costly steps, we use local parallel reducton to elmnate the repettve patterns. It s very lkely that one certan task has a lower computatonal cost than all the others. Therefore, FSMMC creates the thread pool wthn whch each thread s assgned to one certan task of pattern searchng. Those whch fnsh searchng frst wll jon n wth other threads. In other words, FSMMC allows each thread to process asynchronously, whch can help to gan space and reduce runnng tme effcently. 3) Local motorcade patterns are combned n each storng fle and fnal frequent sequences are derved. After beng processed by each thread, the reducton objects need to be merged. Frst, FSMMC puts the tasks of combnng fles n a global task lst after the fles have been regularly marked, makng sure each task has a number correspondng to ts rank. Then, every thread selects a task from the lst as ther own assgnment and ndependently elmnates nfrequent motorcade tems. Snce all threads are ndependent of each other, only ther calculaton workloads requred to be balanced n order to boost performance. FSMMC repeatedly checks whether there s an dle core. If one exsts, t selects a new task from the global task lst and runs t. All frequent motorcade sequences wll then be fnally dentfed when the task lst becomes empty. 3 Performance Evaluaton In ths secton, we evaluate the performance of FSMMC by checkng ts runnng tme. Suppose the tme we spend on the frst phase,.e., the phase where the global database D s dvded nto D 1,D 2,...,D n, s T 1. The tme that the thread ( = 1,2,...,n) spends on database D to fnd the motorcade sequental pattern S m j s ( j = 1,2,...,m; = 1,2,...,n). We use m to ndcate the number of motorcade sequences on D and n as the total number of threads. The total tme thread spends on D to fnd all the local sequences can then be represented Q j However, the tme for the tradtonal seral computatonal method to fnd the sequental patterns s equvalent to the sum tme of each thread treated separately, as s: T 2 = n =1 T = n =1 m j=1 Q j,=1,2,...,n (5) In the thrd phase (combnng local patterns to obtan all the motorcades frequent sequences), the tme that thread takes s: T = k j=q F j + t,q=1,2,...,k;=1,2,...,n (6) n whch, F j represents the processng tme for fle j, k means the total number of fles for combnng, and t s the system overhead for threads accessng the global task lst, fetchng new assgnments and other system operatons. Remarkably, t k j=q F j. So, the tme FSMMC spends on ths phase by parallel processng on mult-cores s: ( k T 3 Max(T ) Max j=q F j + t ), q=1,2,...,k;=1,2,...,n (7) However, relatve to parallel processng, the tme for tradtonal seral sequental processng approxmates to: T 3 n =1 T = n =1 k j=q (F j + t ), q=1,2,...,k;=1,2,...,n (8) In concluson, the total tme wth FSMMC s: ( ) m T = T 1 + T 2 + T 3 = T 1 + Max j=1 Q j + ( k Max j=q F j + t ),q=1,2,...,k;=1,2,...,n (9) The total runnng tme wth the tradtonal seral approach s: T = T 1 + T 2+ T 3 = T 1 + n =1 m j=1 Q j + n =1 k j=q (F j + t ),q=1,2,...,k;=1,2,...,n (10) Therefore, because Max( m j=1 Q j ) n =1 m j=1 Q j and Max( k j=q F j + t ) n =1 k j=q (F j + t ), FSMMC approach can acheve hgher performance on mult-core processors. Natural Scences Publshng Cor.

1430 DJ. Yu et. al. : Mnng Vehcles Frequently Appearng Together... 4 Case Studes 4.1 Case Envronments The FSMMC approach has been successfully used n fast dentfcaton of VFATs based on massve traffc data streams. In the experment, VFATs are defned as N suspect motorcades, whch pass through the montorng stes wth the support over mn sup. In the testng phase, the attrbutes of vehcle passng records nclude plate number, tme of passng by montorng stes, locaton of montorng stes and vehcle drvng drecton. We ran the test program on an Intel Core 2 processor wth 2.40G Hz and 2GB RAM runnng Wndows XP. The databases used contaned about 3,000,000 records. The FSMMC approach was mplemented wth JDK 1.6. Fg. 2: VFATs found n the case where N = 2(vehcles), δ t = 60(seconds) and mn sup=2.5. 4.2 Case Results We ran the FSMMC approach n dfferent scales of traffc streams by spawnng varyng numbers of threads, where each thread executed the same code for frequent sequence mnng. The approach provded good extensblty by optonally changng the number of threads optonally. Input datasets of the same sze were used and all the results were saved n a fle on hard dsks to be used later. The results generated are shown n Fgure 2. When the mn sup s assgned 2.50, VFATs are the top 15 records. A more detaled analyss of the average runnng tme used to search for VFATs s llustrated n Fgure 3. As shown n the Fgure 3, the more sequences generated, the more calculaton tme for fle reducton s requred. However, as the number of threads ncreases, the ncrease becomes less, especally when the dataset has more than 1,000,000 records. Specfc to a certan mult-core system, the approach can employ resources of exstng mult-core processors through multthread programmng technology, leadng to better results on larger volumes of datasets. In order to verfy the effectveness of the FSMMC approach more ntutvely, we analysed the speed-up of dfferent threads on a four-core and a two-core processor (usng the same datasets wth 2,700,000 records). Fgure 4 shows the average T(2) s about 897.4 seconds n a multthreadng envronment from one thread to fve threads, whereas T(4) s about 261.2 seconds. Thus, the average S(4)/S(2) s approxmately 3.44. Furthermore, as can be seen n Fgure 4, a processor wth more cores can obtan more stable results. Due to the dynamc task dstrbuton mechansms and local parallel reducton, the FSMMC approach reduces dle core tme and the tme requred to combne sequences. It ncorporates runtme performance characterstcs and succeeds n usng mult-core processor collaboraton to optmze the performance of the parallel approach. Ths approach Fg. 3: Runnng tme of searchng for VFATs from dfferent scales of passng vehcles by FSMMC on a four-core processor. could therefore acheve good performance n dentfyng VFATs from massve traffc data streams. Fg. 4: Runnng tme and speed-ups on multple cores wth dfferent thread numbers. Natural Scences Publshng Cor.

Appl. Math. Inf. Sc. 9, No. 3, 1427-1433 (2015) / www.naturalspublshng.com/journals.asp 1431 5 Related Works The effcent analyss of spato-temporal data, generated by movng vehcles, s an essental requrement for ntellgent transportaton servces. To our knowledge, such research currently focuses manly on the methods of effcently extractng long sharable frequent routes [5, 6], or Swarms [7], but not delberately tralng vehcles. In contrast to the rdesharng applcaton, the dentfcaton of VFATs nvolves a huge amount of data and therefore demands more mnng power. Frequent pattern mnng s a core feld n data mnng research. Snce the frst soluton to the problem of frequent tem-set mnng was presented by Agrawal et al. [8], varous specalzed n-memory data structures have been proposed to mprove mnng effcency [9]. It has been recognzed that the set of all frequent tem-sets s too large to be analysed and the nformaton they contan s therefore redundant. To remedy ths, numerous works have studed parallel frequent pattern mnng on clusters to mprove mnng effcency [10, 11]. These works explore a spectrum of trade-offs between computaton, communcaton, memory usage, synchronzaton, and the use of problem-specfc nformaton n parallel data mnng. However, the experments showed synchronzaton costs became qute large f the data dstrbutons were skewed or the nodes were not equally capable. Consderng mult-core systems wth lower nter-processor communcaton costs and lmted off-chp bandwdth, parallel frequent pattern mnng on mult-core processors was poneered by Buehrer et al. [12, 13]. Based on the seral algorthm gspan [14] and the smlar study by Worlen et al. [15], Buehrer et al. proposed a parallel frequent graph mnng algorthm wth excellent scale-up propertes. Ther contrbuton comprses an effcent way to decompose work and to explore the search space n a depth-frst way. They also proposed a way to explot temporal localty of the cache. However, ths method needs excessve memory consumpton due to ts statc embeddng technques. Lucchese et al. proposed smlar strateges for mnng closed frequent tem-sets, whch contan optmzatons for mprovng cache usage when creatng condtonal databases (called projectons n ther paper) [16]. Tatkonda et al. studed the approaches on parallel frequent tree mnng [17]. Ther algorthm could scale up very well wth the number of cores, leadng to a quas-lnear speed-up n a lot of real-world databases. However, t costs too much tme for memory accesses. The past few years have also wtnessed the emergence of several novel approaches other than the mult-core ones for the mplementaton and deployment of large-scale data mnng. MapReduce, whch has been popularzed by Google, s a scalable and fault-tolerant data processng model that enables to process a massve volume of data n parallel wth many low-end computng nodes. We n [18] ntroduce a parallel mplementaton of BIDE algorthm on MapReduce, called BIDE-MR. The experments on an Apache Hadoop cluster show that BIDE-MR attans good parallelzaton. However, the approach presented n ths paper s easer to be mplemented snce t effectvely utlzes the mult-core structure of the sngle node. 6 Conclusons Ths paper presents a novel approach to the fast dentfcaton of VFATs from massve traffc data streams on mult-core processors. To harness the power of the mult-core processors, we use a dynamc task dstrbuton mechansm to balance the workloads of dfferent threads. A thread-steal happens when a task s not comparable wth the cumulatve cost of the other tasks. Both theoretcal analyss and case studes show that the approach takes good advantage of mult-core computng platforms and has hgher performance and speed-up, compared wth other approaches wthout mult-threadng. It s notable that sequental pattern mnng requres teratve scans of the sequence dataset wth numerous data comparsons and analyses. In other words, t s memory ntensve. Therefore, optmzatons of massve storage access are always needed. Other problems, such as how to ncrease the certanty of thread schedulng and how to lmt the search space to further mprove accuracy, stll need to be studed. Acknowledgements The work s supported by Natural Scence Foundaton (No.61472112), Natural Scence Foundaton of Zhejang (No.LY12F02003), the Key Scence and Technology Project of Zhejang (No. 2012C11026-3, No. 2008C11099-1) and the open project of Zhejang Provncal Key Laboratory of Network Technology and Informaton Securty. The authors would also lke to thank anonymous revewers who made valuable suggestons to mprove the qualty of the paper. References [1] Agrawal, R., Srkant, R., Mnng sequental patterns. In: Proc. of ICDE, Tape, Tawan, Mar., 1995. [2] Agrawal, R., Srkant, R., Fast algorthms for mnng assocaton rules. In: Proc. of VLDB, 1994, pp. 487-499. [3] Jn, R., Yang, G., Agrawal, G., Shared memory parallelzaton of data mnng algorthms: Technques, programmng nterface, and performance. IEEE Trans. on Knowl. and Data Eng., vol. 17, no. 1, 2005, pp. 71-89. [4] Fang, W., Lu, M., Xao, X., He, B., Luo, Q., Frequent temset mnng on graphcs processors. In DaMoN 09: Proc. of the 5th Internatonal Workshop on Data Management on New Hardware, New York, NY, USA, ACM, 2009, pp. 34-42. Natural Scences Publshng Cor.

1432 DJ. Yu et. al. : Mnng Vehcles Frequently Appearng Together... [5] Gdfalv, G., Pedersen, T.B., Mnng long, sharable patterns n trajectores of movng objects. GeoInformatca, vol. 13, no. 1, 2009, pp. 27-55. [6] Xue, G., L, Z., Zhu, H., Lu, Y., Traffc-known urban vehcular route predcton based on partal moblty patterns. In: Proc. of the Internatonal Conference on Parallel and Dstrbuted Systems - ICPADS, 2009, pp. 369-375. [7] L, Z., Dng, B., Han, J., Kays, R., Swarm: Mnng Relaxed Temporal Movng Object Clusters. In: Proc. of the VLDB Endowment, vol. 3, no. 1, 2010, pp. 723-734. [8] Agrawal, R., Imlensk, T., Swam, A., Mnng assocaton rules between sets of tems n large databases. In: Proc. of SIGMOD, 1993, pp. 207-216. [9] Goethals, B., Survey on frequent pattern mnng. In: http:// cteseer.st.psu.edu/goethals03survey.html, 2003 [10] Agrawal, R., Shafer, J. C., Parallel mnng of assocaton rules. IEEE Trans. Knowl. Data Eng., vol. 8, no. 6, 1996, pp. 962-969. [11] Zak, M. J., Parthasarathy, S., Oghara, M., L, W., Parallel algorthms for dscovery of assocaton rules. Data Mn. Knowl. Dscov., vol. 1, no. 4, 1997, pp. 343-373. [12] Buehrer, G., Parthasarathy, S., Chen, Y. K., Adaptve parallel graph mnng for CMP archtectures. In: Proc. of ICDM, 2006, pp. 97-106. [13] Buehrer, G., Parthasarathy, S., Km, D., Towards data mnng on emergng archtectures. In: Proc. of 9th SIAM Workshop on Hgh Performance and Dstrbuted Mnng. Bethesda, USA, 2006. [14] Yan, X., Han, J., gspan: Graph-based substructure pattern mnng. In: ICDM, 2002, p. 721. [15] Worlen, M., Menl, T., Fscher, I., Phlppsen, M., A quanttatve comparson of the subgraph mners mofa, gspan, ffsm, and gaston. In: Proc. of the 9th European Conference on Prncples and Practce of Knowledge Dscovery n Databases (PKDD), Porto, Portugal, 2005, pp. 392-403. [16] Lucchese, C., Orlando, S., Perego, R., Parallel mnng of frequent closed patterns: Harnessng modern computer archtectures. In: Proc. of ICDM, 2007, pp. 242-251. [17] Tatkonda, S., Parthasarathy, S., Mnng Tree-Structured Data on Multcore Systems. In: Proc. of VLDB, 2009, pp. 694-705. [18] Yu, D., Wu, W., Zheng, S., Zhu, Z., BIDE-based parallel mnng of frequent closed sequences wth MapReduce, LNCS 7440, 2012, pp.177-186. Dongjn Yu s currently a professor at Hangzhou Danz Unversty and a vstng scholar of Unversty of Calforna, Santa Barbara. He receved hs BS and MS n Computer Applcatons from Zhejang Unversty n Chna, and PhD n Management from Zhejang Gongshang Unversty n Chna. Hs current research efforts nclude ntellgent nformaton processng, program comprehenson and servce computng. He s especally nterested n the novel approaches to constructng large enterprse nformaton systems effectvely and effcently by emergng advanced nformaton technologes. He s the drector of Insttute of Cloud and Bg Data and vce drector of Insttute of Intellgent and Software Technology of Hangzhou Danz Unversty. He s a member of ACM and IEEE, and a senor member of Chna Computer Federaton (CCF). He s also a member of Techncal Commttee of Software Engneerng CCF (TCSE CCF) and a member of Techncal Commttee of Servce Computng CCF (TCSC CCF). Wensheng Dou s currently a postgraduate at Hangzhou Danz Unversty, Chna. He has partcpated n some government-funded projects related wth data management. Hs current research nterests manly nclude data mnng and bg data processng. Wanqng L receved hs PhD degree n mechancs of sold from Lanzhou Unversty n Chna n 2007 and works as an Assocate Professor n Hangzhou Danz Unversty. Hs present nterests are numercal parallel computng and data mnng. Suhang Zheng receved her master degree n computer scence from Hangzhou Danz Unversty n Chna. She has publshed a number of hgh-qualty papers related wth data mnng. She now works for Albaba.com. Natural Scences Publshng Cor.

Appl. Math. Inf. Sc. 9, No. 3, 1427-1433 (2015) / www.naturalspublshng.com/journals.asp 1433 Janhua Shao receved hs bachelor s degree n Mathematcs from Fudan Unversty n Chna. Hs prmary research area ncludes networkng computng and system ntegraton. Natural Scences Publshng Cor.