Data & Knowledge Engineering

Size: px

Start display at page:

Download "Data & Knowledge Engineering"

Jeffery Harrell
5 years ago
Views:

Data & Knowledge Engineering 7 (211) 17 187 Contents lists available at ScienceDirect Data & Knowledge Engineering journal hoepage: www.elsevier.

Departent of Coputer Science, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, 35 71, Republic of Korea article info abstract Article history: Received 6 March 21 Received in

1 Data & Knowledge Engineering 7 (211) Contents lists available at ScienceDirect Data & Knowledge Engineering journal hoepage: An approxiate duplicate eliination in RFID data streas Chun-Hee Lee a, Chin-Wan Chung b, a Data Analytics Group, SAIT, Sasung Electronics, Yongin, , Republic of Korea b Departent of Coputer Science, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, 35 71, Republic of Korea article info abstract Article history: Received 6 March 21 Received in revised for 16 July 211 Accepted 18 July 211 Available online 31 July 211 Keywords: Duplicate eliination RFID Bloo filter Real-tie DBs Sart cards The RFID technology has been applied to a wide range of areas since it does not require contact in detecting RFID tags. However, due to the ultiple readings in any cases in detecting an RFID tag and the deployent of ultiple readers, RFID data contains any duplicates. Since RFID data is generated in a streaing fashion, it is difficult to reove duplicates in one pass with liited eory. We propose one pass approxiate ethods based on Bloo Filters using a sall aount of eory. We first devise Tie Bloo Filters as a siple extension to Bloo Filters. We then propose Tie Interval Bloo Filters to reduce errors. Tie Interval Bloo Filters need ore space than Tie Bloo Filters. We propose a ethod to reduce space for Tie Interval Bloo Filters. Since Tie Bloo Filters and Tie Interval Bloo Filters are based on Bloo Filters, they do not produce false negative errors. Experiental results show that our approaches can effectively reove duplicates in RFID data streas in one pass with a sall aount of eory. 211 Elsevier B.V. All rights reserved. 1. Introduction Recently, due to the advanceent of inforation technology, various kinds of data such as XML, RDF, and RFID data have been generated [18,9,5,8]. Especially, a large aount of RFID data has been generated in any environents since the RFID technology does not require contact in detecting RFID tags and therefore has been used in any areas such as business, ilitary, and edical applications. The RFID adoption in Walart is a typical RFID exaple in the business area. However, the advantage of the RFID technology causes a new proble. Since an RFID tag is detected without contact, if an RFID tag is within a proper range fro an RFID reader, the RFID tag will be detected whether we want to or not. Therefore, if RFID tags stay or ove slowly in the detection region, uch unnecessary data (i.e., duplicate RFID data) will be generated. On the other hand, if RFID tags ove fast or any RFID tags ove siultaneously in the detection region, one RFID reader ay not be able to detect all of the. To prevent issing readings of RFID tags in such cases, several RFID readers are generally deployed in order to onitor a single location [1,2]. When several RFID readers detect one RFID tag at the sae tie, duplicate data is generated. Therefore, we cannot avoid the generation of duplicate RFID data in RFID applications. An intelligent RFID reader with the processing capability can eliinate duplicate RFID data generated by the RFID reader. However, duplicate RFID data generated fro ultiple readers can not be reoved with only the self-contained processing capability of RFID readers. Therefore, we need a technique to eliinate duplicate RFID data in the server (RFID iddleware) that collects RFID data fro various RFID readers. Consider the exaple in Fig. 1. There are two RFID readers and two tags pass through the detection region. In this situation, Reader1 and Reader2 detect tags with the identifier ID1 and ID2. Each reader generates the detection inforation such as btag ID, location of the reader, tien. Reader1 detects the tag with ID1 and generates RFID data bid1, Loc1, 1N, bid1, Loc1, 2N, bid1, Loc1, 4N. However, the RFID data is duplicate data except bid1, Loc1, 1N. Also, Reader1 generates RFID data bid2, Loc1, 3NbID2, Loc1, 5N for the tag with ID2 and bid2, Loc1, 5N is duplicate data. In the sae way, Reader2 generates RFID data and has duplicates as shown Corresponding author. Tel.: ; fax: E-ail addresses: chunhee1.lee@sasung.co (C.-H. Lee), chungcw@kaist.edu (C.-W. Chung) X/$ see front atter 211 Elsevier B.V. All rights reserved. doi:1.116/j.datak

2 C.-H. Lee, C.-W. Chung / Data & Knowledge Engineering 7 (211) Reader1(Loc1) Tag: ID2 Tag: ID1 Reader2(Loc2) <ID1, Loc1, 1> <ID1, Loc1, 2> <ID2, Loc1, 3> <ID1, Loc1, 4> <ID2, Loc1, 5> <ID1, Loc2, 1> <ID2, Loc2, 2> <ID1, Loc2, 3> <ID2, Loc2, 5> Server <ID1, Loc1, 1> <ID1, Loc2, 1> <ID1, Loc1, 2> <ID2, Loc2, 2> <ID2, Loc1, 3> <ID1, Loc2, 3> <ID1, Loc1, 4> <ID2, Loc2, 5> <ID2, Loc1, 5> <ID1, Loc1, 1> <ID2, Loc2, 2> Fig. 1. An exaple for duplicate RFID data eliination. in the lower part of Fig. 1. This is because a reader detects a tag continuously within the detection region. RFID data generated in each reader is sent to the server. Then, there are ore duplicates in the server since two readers ay detect the sae tag. For exaple, bid1, Loc1, 1N in Reader1 and bid1, Loc2, 1N in Reader2 are duplicate. Therefore, the server reoves duplicate RFID data as preprocessing and the result for preprocessing is bid1, Loc1, 1NbID2, Loc2, 2N. This proble sees siple for a sall aount of RFID data. However, the volue of RFID data is generally very big. Also, if a tag is attached to each ite, the aount of data generated in a large retailer will exceed terabytes in a day [13]. And, RFID data is produced in a strea. It eans that a duplicate eliination ethod should process data instantly with liited eory. It is difficult to design an exact duplicate eliination ethod. As an alternative, we can use approxiate duplicate eliination techniques for RFID applications that do not require exact answers. In such an application, we can consider a real-tie analysis application for the oveent of custoers in a large departent store. Each store in the departent store has RFID readers. Each custoer has a unique RFID tag. The anager wants the real-tie analysis of the oveent of custoers such as the nuber of custoers in each store and the store which has the axiu nuber of custoers. For such an analysis, the central server in the departent store should reove duplicates. In this environent, so uch RFID data coes into the server siultaneously and there are duplicates. Especially, when a custoer stays at the sae location for a long tie, it generates a large nuber of unnecessary duplicate data. In order to eliinate duplicates exactly, we need to keep all RFID data including duplicates in eory during a long period. Therefore, it is difficult to eliinate duplicates exactly in real-tie using a sall aount of eory relative to the aount of RFID data. In this application, the anager does not feel uncofortable even if statistics with allowable errors are provided. Therefore, we propose approxiate RFID duplicate eliination techniques in one pass with the liited eory. Bloo Filters have been widely used as a very copact data structure with an allowable error. To anage RFID data streas with a sall aount of eory, we devise Bloo Filter based approaches. However, since Bloo Filters are targeted for static data, we should adapt Bloo Filters to RFID data strea environents. We thus propose Tie Bloo Filters as a straightforward adaptation of Bloo Filters. Since Tie Bloo Filters are based on Bloo Filters, they do not generate false negatives which are duplicate data contained in the result after filtering. However, they ay generate false positives that are non-duplicate data which are not contained in the result after filtering. We provide the false positive probability for Tie Bloo Filters. Also, to reduce false positives for Tie Bloo Filters, we devise Tie Interval Bloo Filters using the concept of the interval. Our contributions are as follows: The Effective Duplicate Eliination Methods. In data strea environents, it is not easy to design one pass duplicate eliination algorith with liited eory. To design such an algorith, we adapt Bloo Filters to RFID data strea environents. We propose effective approxiate duplicate eliination ethods, Tie Bloo Filters and Tie Interval Bloo Filters. They can eliinate duplicates in one pass with a sall aount of eory. Space Optiization for Tie Interval Bloo Filters. While the Tie Bloo Filter stores one tie field, the Tie Interval Bloo Filter stores two tie fields as a tie interval. Therefore, the Tie Interval Bloo Filter needs ore space than the Tie Bloo Filter. We devise a ethod to reduce space for the Tie Interval Bloo Filter. The Forulation of a Duplicate Eliination Proble. Though any papers ention the duplicate eliination proble in RFID data, they do not forulate it rigorously. In this paper, we forulate the duplicate eliination proble in RFID data streas forally. Paraeter Setting. In Tie Bloo Filters and Tie Interval Bloo Filters, the nuber of hash functions k affects errors. We propose a forula to find k and validate it using an experiental evaluation Organization The rest of this paper is organized as follows. In Section 2, we present related work. In Section 3, we explain Bloo Filters as a preliinary, and in Section 4, we foralize the duplicate eliination proble in RFID data streas. We describe Tie Bloo Filters and Tie Interval Bloo Filters in Sections 5 and 6, respectively. In Section 7, we discuss paraeter setting in Tie Bloo Filters and Tie Interval Bloo Filters. We easure the effectiveness of Tie Bloo Filters and Tie Interval Bloo Filters experientally in Section 8. Finally, we conclude our work in Section 9.

3 172 C.-H. Lee, C.-W. Chung / Data & Knowledge Engineering 7 (211) Related work As RFID tags have been fabricated at a very low cost, an enorous volue of RFID tags will be used in various places (e.g., retail stores, hospitals) in the near future. Therefore, a vast aount of RFID data will be generated. Chawathe et al.[8] discuss challenges for anaging such an enorous volue of RFID data. Also, Bornhovd et al.[5] give an overview of the Auto-ID infrastructure (Device Layer/Device Operation Layer/Business Process Bridging Layer/Enterprise Application Layer) which integrates data fro sart ites (e.g., RFID, sensor) with existing business processes. RFID-based ite-tracking applications are typical RFID applications. To support ite-tracking applications, soe work has been done [14,13,3]. In[14], the bitap datatype and its operations are introduced. By sharing the prefix of EPC, 1 a collection of EPCs is copactly represented in the bitap datatype. However, [14] does not focus on query processing in an RFID-based ite tracking. Gonzalez et al. [13] propose a new warehousing odel, RFID-Cuboid, for RFID data copression and path-dependent aggregates. Also, to trace RFID tags efficiently, Ban et al. [3] propose the Tie Paraeterized Interval R-Tree. Cao et al.[6] present a distributed RFID strea processing syste with the inference for tracking and onitoring. In addition, the works in [12,16,17] provide a tool to analyze a large aount of RFID data in the warehouse environent. Gonzalez et al. [12] propose FlowCube to analyze coodity flows at ultiple levels. Lee and Chung [16,17] devise an effective path encoding schee to answer path oriented queries. As raw RFID data obtained fro RFID readers isses readings and includes duplicate readings, there has been soe work on RFID data cleansing [1,15,21]. Since RFID data fro various sites is sent to the server, there exists heavy traffic in the network. In [1], to reduce network traffic, the edge processing is proposed. In the edge processing, data aggregation and soothing, filtering, and grouping are perfored as early as possible. The teporal soothing filter that infers issing readings within a fixed tie window is a general ethod to correct issing readings. However, it is difficult for users to decide the best tie window size and the soothing filter with a fixed tie window size does not perfor well due to the dynaics of RFID flows. Thus, to correct issing readings, Shawn et al.[15] use an adaptive window size according to the characteristics of RFID flows. In order to adapt the window size, they apply statistical sapling. Generally, data cleansing for RFID data streas is perfored in a preprocessing step. However, Rao et al.[21] propose a deferred data cleansing ethod which is perfored during query execution. One of the ost general and iportant cleaning of RFID data is to reove duplicate RFID data. There has been traditional research such as hash-based algoriths and sorting-based algoriths in duplicate eliination [11]. However, since those algoriths focus on duplicate eliination in stored data, they need to scan data any ties. Thus, it is difficult to apply traditional duplicate eliination techniques to RFID data streas. To eliinate duplicates for streaing data, approxiate algoriths are proposed [2,1]. In[2], Metwally et al. use Bloo Filters for detecting duplicates in click streas. However, if click streas are generated continuously, all entries of the Bloo Filter will becoe 1 in the long run and then the Bloo Filter will be useless. In [1], Deng et al. solve the above proble using Stable Bloo Filters. In order to reove old data in a streaing environent, the Stable Bloo Filter sets cells corresponding to an input data to the axiu value and decreases values of randoly selected cells whenever data arrives. However, Stable Bloo Filters have false positive errors as well as false negative errors. In addition, the paraeter setting in Stable Bloo Filters is coplex and its solution is not always optial. In [23], Wang et al. provide a ethod to reove duplicates over distributed data streas. The eager approach and the lazy approach are proposed to share Bloo Filters between the coordinator and the reote sites. To eliinate duplicate RFID data, HashMerge is proposed in [2]. In HashMerge, a hash structure is used for an efficient duplicate eliination. However, it does not deal with the deletion operation which reoves unnecessary data in the hash table. Also, the proble of RFID duplicate eliination is entioned in [1,21,22] but they do not propose algoriths to eliinate duplicates. The authors in [19] use count Bloo Filters to reove duplicate RFID data. However, they consider duplicate eliination only at the reader level. Therefore, RFID readers should preprocess RFID data and send the preprocessed data to the server. In the case of RFID readers without the self-contained coputing power, the approach cannot be applied. Also, the critical disadvantage of the approach is that it generates both false positives and false negatives. While the above approaches focus on eliinating duplicate RFID data, [7] focuses on preventing the generation of duplicate RFID data. Duplicate RFID data generated in the sae reader can be reoved easily with the self-contained processing capability. However, duplicate RFID data generated in different readers cannot be reoved with only the self-contained processing capability of the readers. To prevent the generation of duplicate RFID data in different readers, [7] deals with the redundant reader eliination. By reoving the redundant readers which read the sae tag siultaneously, [7] prevents the generation of duplicate RFID data. However, they assue writable tags on which soe inforation can be written. Therefore, the ethod in [7] cannot be applied to the general cases. 3. Preliinary Bloo Filters were originally proposed in [4]. The goal of Bloo Filters is to test whether any eleent is contained in a given set S using a sall aount of eory. Although it is easy to test the ebership of any eleent, if a set contains a large nuber of eleents, we have to keep all eleents of the set in eory and thus need a large aount of eory. Bloo Filters test the ebership with allowable errors using a sall aount of eory copared to the given set. 1 Electronic Product Code(EPC) is a coding schee of RFID tags to identify the uniquely.

4 C.-H. Lee, C.-W. Chung / Data & Knowledge Engineering 7 (211) Field Raw RFID Data Strea Filtering Module Non-duplicate RFID Data Strea Applications Duplicate RFID Data Server Drop Duplicate RFID Data Fig. 2. The syste architecture. The Bloo Filter is based on a hash coding. It consists of an array of bits and k hash functions (h 1,h 2,,h k ). We assue that k hash functions address eleents to range {,, 1} with a unifor distribution and are independent fro each other. Initially, all cells of the Bloo Filter are set to. To represent a large nuber of eleents in a set S using the Bloo Filter, the Bloo Filter sets cells, which correspond to h 1 (a), h 2 (a),,h k (a), to 1 for each eleent a in S. Since one cell ay be set to 1 by different eleents, the Bloo Filter tests the ebership of any eleent with errors. To test whether an eleent a is contained in a set S using the Bloo Filter, we check k cells which correspond to h 1 (a),h 2 (a),, h k (a). If the value of at least one cell aong k cells is, it is obvious that a is not in S. In this case, the Bloo Filter reports that a is not contained in S. If all values of k cells are 1, a is probably in S. The Bloo Filter reports that a is contained in S. In this case, there ay exist false positive errors that report a is in S although it is not in S. However, in the Bloo Filter, there do not exist false negative errors that report that a is not in S although a is in S. When the nuber of hash functions is k, the nuber of cells is, and the nuber of eleents in a set is n, the false positive probability f is f = kn k [4].Ifis large enough, 1 1 kn e kn =. Thus, f (1 e kn/ ) k. When n and are given, Bloo Filters can evaluate k to iniize f using derivatives. When k = ðln2þ, f is iniized. Therefore, Bloo Filters set k n to ðln2þ in order to iniize false positive errors. n 4. Proble stateent The syste architecture that we consider is depicted in Fig. 2. The RFID data strea generated in the field coes into the server. There are the Filtering Module and Applications in the server. Before the raw RFID data strea is oved to applications, it passes through the filtering odule. With the filtering odule, duplicate RFID data is reoved fro the raw RFID data strea. Applications receive non-duplicate RFID data in a strea. That is, in the filtering odule, duplicate RFID data is iediately dropped fro the raw RFID data strea. RFID data is generated at any RFID readers continuously. RFID readers send RFID data to the server. Hence, the server receives a sequence of RFID data in a strea. RFID data consists of a tag identifier (i.e., EPC), the location of the RFID reader, and the detected tie of the tag. We define an RFID data strea forally as follows: Definition 1. An RFID data strea S is a set {s 1,,s n }, 2 where s i consists of a triple (TagID,Loc,Tie) such that "TagID" is the electronic product code (EPC) of the tag and used for identifying the tag uniquely. "Loc" is the location of the reader which detects the tag. "Tie" is the tie of detecting the tag. In an RFID data strea, RFID data x is considered as a duplicate if there exists y( x) in the RFID data strea such that x. TagID=y. TagID, and x. Tie y. Tie τ, where τ is an application-specific positive value [2,21,22]. Fro the definition of the duplicate, we can find a non-duplicate RFID data strea by eliinating duplicates. However, in a case that RFID data with the sae TagID is generated repeatedly at intervals that are less than or equal to τ, finding a non-duplicate RFID data strea is confusing. For exaple, consider an RFID data strea S={s 1,s 2,s 3 }, where s 1 =(tag1, loc1,5), s 2 =(tag1,loc1,1), and s 3 =(tag1,loc1,15) (τ=8). Fro S, we know that s 2 is a duplicate because of s 1, and s 3 is a duplicate because of s 2. However, S arrives at the server in sequence. If s 2 arrives at the server, we can first reove s 2 fro {s 1,s 2 }. In that case, if s 3 arrives at the server, s 3 is not a duplicate since there does not exist s 2. 2 Strictly speaking, an RFID data strea S is a sequence bs 1,,s n N but, in this section, we consider an RFID data strea as a set because eleents in an RFID data strea contain the tie inforation and a sequence is generated in tie order.

5 174 C.-H. Lee, C.-W. Chung / Data & Knowledge Engineering 7 (211) (a) Tie Bloo Filter (b) Tie Interval Bloo Filter Fig. 3. The structures of the Tie Bloo Filter and the Tie Interval Bloo Filter. Depending on the property of applications, a non-duplicate data strea for S can be defined differently. In this paper, a nonduplicate RFID data strea for S is considered as {s 1 } instead of {s 1,s 3 } since data with the sae TagID that is generated repeatedly at intervals that are less than or equal to τ is generally useless. We forulate the duplicate RFID data eliination proble forally with soe definitions. Definition 2. For RFID data x and y, x and y are connected in S if there exists z S such that x.tagid=y.tagid, z.tagid=x.tagid, x.tie z.tie τ and z.tie y.tie τ. x and y are connected in S also if x and z are connected in S and z and y are connected in S. Connected is defined in order to reove duplicate RFID data of the sae TagID that is generated repeatedly at intervals that are less than or equal to τ. If an RFID data strea contains two eleents which have the sae TagID and Tie, one is an obvious duplicate. In this case, we consider the latter eleent in a sequence order as a duplicate. We focus on the eleents with different Tie to identify a duplicate. Using Definition 2, we define the duplicate forally in Definition 3. Definition 3. RFID data x is a duplicate in S if there exists any y( x) S such that (x.tagid=y.tagid and x.tie y.tie τ) or (x and y are connected in S and y.tie x.tie). The non-duplicate RFID data strea for an RFID data strea S is an RFID data strea after reoving all duplicate RFID data in S. We can define the non-duplicate RFID data strea as the duplicate-free axial set. The duplicate-free set and the duplicate-free axial set are defined as follows: Definition 4. A set S ( S) is called a duplicate-free set in S if there does not exists x S such that x is a duplicate in S. Definition 5. The set S ( S) is a duplicate-free axial set in S if S is a duplicate-free set in S and for all x S S, x is a duplicate in S. We want to find the duplicate-free axial set S (i.e., a non-duplicate RFID data strea) for an RFID data strea S. However, it is difficult to find the duplicate-free axial set in data streas with a sall aount of space. Since soe applications allow errors, we consider a ethod to find an approxiation of the duplicate-free axial set. Especially, we liit allowable errors as false positive errors in this paper. Therefore, our proble is defined as follows: Given space and a assive RFID data strea S, find a duplicate-free set Ŝ using space such that j Ŝ S j=js j is iniized, where S is the duplicate-free axial set in S. 5. Tie Bloo Filters To eliinate duplicates in RFID data streas, we first devise a siple approach based on Bloo Filters, called the Tie Bloo Filter. Fro Definition 3, we know that RFID data x ay not be a duplicate according to x.tie even if there exists RFID data with the sae TagID in an RFID data strea. Therefore, we can use the tie inforation in detecting RFID duplicates. While each entry in the Bloo Filter is set to zero or one, each entry in the Tie Bloo Filter is set to the detected tie of an RFID tag. That is, the Tie Bloo Filter uses an integer array instead of a bit array. The Tie Bloo Filter is depicted in Fig. 3(a). The Tie Bloo Filter uses k independent hash functions (h 1,h 2,,h k ) with range {,, 1} like the Bloo Filter. The value of the i-th cell in the Tie Bloo Filter is denoted by M[i]. In order to store RFID data x in the Tie Bloo Filter, we find k cells that correspond to h 1 (x.tagid),,h k (x.tagid). The Tie Bloo Filter then sets k cells to the detected tie of RFID data x (i.e., x.tie). If the cells are already set to the detected tie of the previous RFID data, the Tie Bloo Filter overwrites it with the detected tie of the current RFID data. 3 To know whether soe RFID data x is a duplicate, we check k cells corresponding to h 1 (x.tagid),,h k (x.tagid). If there exists at least onecellsuchthatx.tie M[h i (x.tagid)]nτ, we are sure that RFID data x is not a duplicate since duplicate data exists within τ tie. In the 3 We assue RFID data streas arrive at the server in tie order for convenience.

6 C.-H. Lee, C.-W. Chung / Data & Knowledge Engineering 7 (211) Fig. 4. Duplicate eliination algorith of the Tie Bloo Filter. Tie Bloo Filter, we can regard the condition x. Tie M[h i (x. TagID)] τ as 1 in the Bloo Filter and the condition x. Tie M[h i (x. TagID)]Nτ as in the Bloo Filter. Note that the evaluation of the condition of the Tie Bloo Filter changes as tie goes by. Fig. 4 shows the algorith to eliinate duplicates in RFID data streas using the Tie Bloo Filter. The algorith can find the duplicate-free set with a sall aount of eory. All cells in the Tie Bloo Filter are initially set to. When RFID data x arrives at the server, x passes through the Tie Bloo Filter. By the Tie Bloo Filter, x will be filtered. In other words, x will be dropped or sent to the application. First, when x passes through the Tie Bloo Filter, we check the condition x.tie M[h i (x.tagid)] τ for all i {1,2,,k} to know whether x is a duplicate or not (Line 1). Since all cells are initially set to, if the value of at least one cell is, x is not a duplicate regardless of the condition. If the condition x.tie M[h i (x.tagid)] τ is satisfied and the value of M[h i (x.tagid)] is not for all i {1,2,,k}, x will be probably a duplicate. Then, we drop it (Line 2). Otherwise, we send the non-duplicate data to the application (Lines 3 4). Finally, we update cells to x.tie whenever x is a duplicate or not (Line 5). If we update cells only when that data is not a duplicate, we cannot reove duplicate data of the sae TagID that is generated repeatedly at intervals that are less than or equal to τ. Exaple 1. Fig. 5 shows the state of the Tie Bloo Filter after an RFID data strea S={s 1,s 2,s 3 } passes through the Tie Bloo Filter, where s 1 =(ID1, Loc1, 1), s 2 =(ID2, Loc2, 12), and s 3 =(ID1, Loc1, 13). Suppose hash functions are as shown in Fig. 5, the size of an array in the Tie Bloo Filter is 8, k is 3, and τ is 1. To explain the Tie Bloo Filter, we assue h 2 (ID1)=h 1 (ID2). Consider how the Tie Bloo Filter operates for an RFID data strea S={s 1,s 2,s 3 }. When s 1 arrives at the Tie Bloo Filter, it checks whether s 1 is a duplicate or not. Since all values in M[],M[5],and M[2] are initially, s 1 is not a duplicate. Therefore, s 1 is sent to the application. The Tie Bloo Filter then sets M[],M[5],and M[2] to 1 in Fig. 5(a). When s 2 passes through the Tie Bloo Filter, it is sent to the application since it is not a duplicate. The Tie Bloo Filter sets M[5],M[7],and M[3] to 12 as shown in Fig. 5(b). In the case of M[5], the Tie Bloo Filter sets it to 12 although 1 was set previously. When s 3 passes through the Tie Bloo Filter, it is not a duplicate since 13 M[]Nτ and 13 M[2]Nτ (M[], and M[2] are values in Fig. 5(b)). s 3 is sent to the application and the Tie Bloo Filter sets M[], M[5], and M[2] to 13 in Fig. 5(c). In this exaple, since all eleents are not duplicates, the application receives all data. We can derive the false positive probability for the Tie Bloo Filter in a siilar way as the Bloo Filter [4]. We provide the false positive probability for the Tie Bloo Filter in Theore 1. (a) (b) (c) Fig. 5. The state of the Tie Bloo Filter in Exaple 1.

7 176 C.-H. Lee, C.-W. Chung / Data & Knowledge Engineering 7 (211) Theore 1. The false positive probability for the Tie Bloo Filter is kn k, where k is the nuber of hash functions, is the nuber of cells, and n is the nuber of non-duplicate RFID data within τ tie. Proof. In order to derive the false positive probability, we assue that a non-duplicate RFID data strea coes to the Tie Bloo Filter. In real applications, RFID data has any duplicates. However, duplicate RFID data sets the sae cells in the Tie Bloo Filter since they have the sae TagID. Also, they set cells to a siilar tie since they are usually tie clustered. Therefore, the state of the Tie Bloo Filter after passing through the original RFID data is siilar to that of the Tie Bloo Filter after passing through non-duplicate RFID data reoving duplicates fro the original RFID data. Therefore, it is reasonable to assue that non-duplicate RFID data coes to the Tie Bloo Filter. Consider, in the past, soe RFID data passed through the Tie Bloo Filter. Now, for any eleent x, we want to know the false positive probability. Since duplicates are defined within τ tie, data that passed through the Tie Bloo Filter before τ tie is eaningless. Thus, we consider only data within τ tie with respect to x. Tie. Let p 1 be the probability that x.tie M[h i (x.tagid)] τ for 1 i k and p 2 be the probability that x is a duplicate. Then, The false positive probability for x = p 1 p 2 To derive p 1,wefirst derive the probability that x.tie M[u] τ for soe u. We assue that hash functions have unfor distributions. For RFID data y within τ tie, the probability that h i (y.tagid) u for soe i is 1 1. The probability that h i (y.tagid) u for all i {1,2,,k} is 1 1 k. Since n is the nuber of RFID data which coes to the Tie Bloo Filter within τ tie, the probability of x.tie M[u]Nτ is 1 1 k n. The probability of x.tie M[u] τ is kn. Therefore, the probability p1 that x.tie M[h i (x.tagid)] τ for all i {1,2,,k} is kn k. h 1 (x.tagid) h 2 (x.tagid) h 3 (x.tagid) h 4 (x.tagid) Start Tie End Tie M[] 1 25 M[1] 2 M[2] 3 15 M[3] 1 3 M[4] 5 3 M[5] 15 M[6] 1 1 M[7] intersection (a) The case where the intersection is not epty h 1 (x.tagid) h 2 (x.tagid) h 3 (x.tagid) h 4 (x.tagid) Start Tie End Tie M[] 7 M[1] 5 15 M[2] 1 3 M[3] 13 3 M[4] 5 25 M[5] 1 2 M[6] 1 1 M[7] intersection (b) The case where the intersection is epty Fig. 6. An exaple for the intersection of tie intervals in the Tie Interval Bloo Filter.

8 C.-H. Lee, C.-W. Chung / Data & Knowledge Engineering 7 (211) Since we assue that a non-duplicate RFID data strea coes to the Tie Bloo Filter, p 2 is equals to. Therefore, the false positive probability for the Tie Bloo Filter is kn! k. 6. Tie interval bloo filters In the previous section, we introduced Tie Bloo Filters. However, since they are a siple extension of Bloo Filters, we devise Tie Interval Bloo Filters to iprove Tie Bloo Filters. Since detection regions are distributed with proper ranges and RFID tags are detected only within or around detection regions, duplicate RFID data is location-clustered, and therefore tieclustered. That is, if soe RFID tag is detected, RFID data with the sae TagID will be detected any ties during a period. Thus, we can represent the range of detected ties of RFID data with the sae TagID as a short tie interval. Fig. 3(b) shows the structure of the Tie Interval Bloo Filter. The start tie and the end tie of the i-th cell are denoted by M[i].StartTie and M[i].EndTie, respectively. StartTie and EndTie are initially set to and 1, respectively. Consider how the Tie Interval Bloo Filter keeps StartTie and EndTie in a cell. For an easy explanation, we assue that the nuber of hash functions is 1 and one cell in the Tie Interval Bloo Filter is set by only one TagID. When RFID data x with TagID=1 arrives at the Tie Interval Bloo Filter first, we set both StartTie and EndTie of the cell corresponding to TagID=1 to x. Tie. That is, the initial tie interval is a tie point. When another RFID data y with TagID=1 arrives at the Tie Interval Bloo Filter, we change only EndTie of the cell corresponding to TagID=1 to y. Tie. Thus, the Tie Interval Bloo Filter keeps the tie interval corresponding to TagID=1 in the cell. For RFID data x, if x. Tie EndTie is ore than τ, x. Tie StartTie is also ore than τ. Such a tie interval is useless. In this case, we set both StartTie and EndTie to x.tie again to initialize the tie interval. Since one cell ay be set by various RFID data with different TagID, the tie interval for each TagID ay be ixed. However, the interval [StartTie, EndTie] in the cell includes detected ties of all RFID data with TagID corresponding to the cell. To know whether soe RFID data x is a duplicate, we check whether the intersection of all tie intervals corresponding to h 1 (x),h 2 (x),,h k (x) is epty. If soe RFID data with TagID=x.TagID arrived at the Tie Interval Bloo Filter, all tie intervals corresponding to x.tagid should have included the detected tie of that data and the intersection would have not been epty. Therefore, when the intersection is epty, we are sure that any RFID data with TagID=x.TagID did not arrive at the Tie Interval Bloo Filter within τ tie. Fig. 6 illustrates how to deterine whether soe RFID data x is a duplicate. The coordinate on the right side of the Tie Interval Bloo Filter of Fig. 6 represents the tie intervals corresponding to RFID data x. In Fig. 6(a), the intersection for four tie intervals is [1,15]. Therefore, it is not epty and x is considered as a duplicate. In Fig. 6(b), the intersection for four tie intervals is epty. Therefore, the Tie Interval Bloo Filter reports that x is not a duplicate. Fig. 7 shows the algorith to eliinate duplicates using the Tie Interval Bloo Filter. If RFID data x arrives at the Tie Interval Bloo Filter, we first check whether x is a duplicate or not. To do this, we check whether the intersection of intervals Fig. 7. Duplicate eliination algorith of the Tie Interval Bloo Filter.

9 178 C.-H. Lee, C.-W. Chung / Data & Knowledge Engineering 7 (211) Tie h 1 (ID1) M[] 11 h 2 (ID1) M[1] h 3 (ID1) M[2] 11 h 1 (ID2) M[3] M[4] 18 h 1 (ID4) h 2 (ID2) M[5] 11 h 2 (ID4) h 3 (ID2) M[6] M[7] 15 h 3 (ID4) h 1 (ID3) M[8] M[9] h 2 (ID3) M[1] h 3 (ID3) M[11] 18 M[12] 15 M[13] Fig. 8. The state of the Tie Bloo Filter in Exaple 2. corresponding to h 1 (x),h 2 (x),,h k (x) is not epty (Lines 1 7). If the intersection (intersectinterval in Line 6) is located in the tie before x. Tie τ (i.e., x. Tie intersectinterval. EndTieNτ), we regard it as epty since duplicates are defined within τ tie. Therefore, when intersectinterval is not epty and x. Tie intersectinterval. EndTie τ, x is a duplicate and is dropped (Lines 7 8). Otherwise, x is sent to the application (Line 9 1). We then update StartTie and EndTie (Lines 11 2). If StartTie= (a case where data corresponding to the cell is first inserted) or x. Tie EndTieNτ, we initialize both StartTie and EndTie to x. Tie (Lines 13 17). Otherwise, we update EndTie and extend tie intervals (Lines 18 19). Exaple 2. Consider an RFID data strea {(ID1, Loc1, 1), (ID1, Loc1, 11), (ID2, Loc2, 14), (ID3, Loc3, 15), (ID2, Loc4, 17), (ID2, Loc2, 18)}. Before explaining the operations of the Tie Interval Bloo Filter, we explain those of the Tie Bloo Filter. Fig. 8 shows the state of the Tie Bloo Filter after the RFID data strea passes through the Tie Bloo Filter. Suppose hash functions are as shown in Figs. 8 and 9, the size of an array in filters is 8, k is 3, and τ is 1. To explain the Tie Interval Bloo Filter, we assue h 1 (ID1)=h 2 (ID4), h 1 (ID2)=h 1 (ID4), and h 3 (ID3)=h 3 (ID4). If (ID4,Loc4,2) passes through the Tie Bloo Filter in Tie h 1 (ID1) M[] 1 11 h 2 (ID1) M[1] h 3 (ID1) M[2] 1 11 h 1 (ID2) M[3] M[4] h 1 (ID4) h 2 (ID2) M[5] 1 11 h 2 (ID4) h 3 (ID2) M[6] M[7] h 3 (ID4) h 1 (ID3) M[8] M[9] h 2 (ID3) M[1] h 3 (ID3) M[11] M[12] M[13] Fig. 9. The state of the Tie Interval Bloo Filter in Exaple 2.

10 C.-H. Lee, C.-W. Chung / Data & Knowledge Engineering 7 (211) Fig. 8, the Tie Bloo Filter reports that it is a duplicate since 2 M[h 1 (ID4)] τ, 2 M[h 2 (ID4)] τ and 2 M[h 3 (ID4)] τ. However, it is not a duplicate. That is, it is a false positive. Meanwhile, the Tie Interval Bloo Filter keeps the start tie and the end tie. Fig. 9 shows the state of the Tie Interval Bloo Filter after the RFID data strea above passes through the Tie Interval Bloo Filter. Consider the case where (ID4, Loc4, 2) passes through the Tie Interval Bloo Filter in Fig. 9. The intersection of three intervals corresponding to ID4 (i.e., [1,11], [14,18], [15]) is epty. Therefore, the Tie Interval Bloo Filter reports that it is not a duplicate and it is really not a duplicate. For the Tie Interval Bloo Filter, it is difficult to derive the false positive probability since the false positive probability ust be coputed fro the intersection of the tie intervals. Instead of deriving the false positive probability for the Tie Interval Bloo Filter, we provide Theore 2. Fro Theore 2, we know that the Tie Interval Bloo Filter is better than the Tie Bloo Filter in ters of the false positive probability. Theore 2. Consider the Tie Bloo Filter and the Tie Interval Bloo Filter which have the sae nuber of cells and the sae configuration for hash functions. Then, for any RFID data streas, the nuber of false positives of the Tie Interval Bloo Filter is less than or equal to that of the Tie Bloo Filter. Proof. Assue that the sae RFID data strea coes to the Tie Bloo Filter and the Tie Interval Bloo Filter. Since false positives are generated only when the Tie Bloo Filter or the Tie Interval Bloo Filter reports that a given data is a duplicate, we prove that for any eleent x, if the Tie Interval Bloo Filter reports that x is a duplicate, then the Tie Bloo Filter also reports that x is a duplicate. Suppose that the Tie Interval Bloo Filter reports that x is a duplicate. Then, intersectinterval ϕ and x:tie intersectinterval:endtie τ: ð1þ Since intersectinterval is not epty in the Tie Interval Bloo Filter, all lines corresponding to x in the Tie Interval Bloo Filter (i.e., [s i,e i ] for 1 i k) include intersectinterval. Therefore, e i intersectinterval:endtie for 1 i k ðalso; s i intersectinterval:starttieþ ð2þ See how to update cells in the Tie Bloo Filter and the Tie Interval Bloo Filter (i.e., Line 5 in Fig. 4 and Lines 11 2 in Fig. 7). We can easily know that e i in the Tie Interval Bloo Filter is equal to i in the Tie Bloo Filter, where e i is defined in Line 4 of Fig. 7 and i is M[h i (x.tagid)] in the Tie Bloo Filter. Thus, by the Eq. (2) i = e i intersectinterval:endtie By ultiplying 1 in both sides i intersectinterval:endtie By adding x.tie in both sides, x:tie i x:tie intersectinterval:endtie Since x. Tie intersectinterval. EndTie τ by Eq. (1), x:tie i τ That is, all conditions in Line 1 of Fig. 4 are satisfied. Therefore, the Tie Bloo Filter reports that x is a duplicate Space optiization If the nuber of cells of the Tie Bloo Filter is equal to that of the Tie Interval Bloo Filter, the Tie Interval Bloo Filter requires twice as uch space as the Tie Bloo Filter. However, since only the tie interval within τ tie is useful, we can optiize space for the Tie Interval Bloo Filter. We keep StartTie and EndTie StartTie(DiffTie) instead of StartTie and EndTie as shown in Fig. 1. Since the space for DiffTie is less than that for EndTie, we can reduce the space. However, DiffTie ay be ore than τ while the algorith to eliinate duplicates using the Tie Interval Bloo Filter in Fig. 7 runs. Therefore, we revise the algorith in order to keep DiffTie less than or equal to τ. Using the revised algorith in Fig. 11, we can always keep DiffTie less than or equal to τ. We will explain it later. StartTie and EndTie require uch space in reality copared with DiffTie, because they record in ost cases onth, date, hour, inute, and second. Since τ is a sall value, we do not need uch space to store DiffTie.

11 18 C.-H. Lee, C.-W. Chung / Data & Knowledge Engineering 7 (211) M[] M[1] M[2] StartTie EndTie StartTie DiffTie M[] M[1] M[2] (a) Tie Interval Bloo Filter (b) Tie Interval Bloo Filter with space optiization Fig. 1. Space optiization in the Tie Interval Bloo Filter. The duplicate eliination algorith of the Tie Interval Bloo Filter with space optiization is shown in Fig. 11. All StartTie and DiffTie are initially set to and 1, respectively. Like the algorith TIBF(), the algorith TIBF_OPT() first checks whether x is a duplicate or not, then it drops x or sends x to the application (Line 1 1). In TIBF_OPT(), EndTie can be calculated easily fro StartTie+DiffTie (Line 4). To keep DiffTie less than or equal to τ, TIBF_OPT() update cells differently fro TIBF() (Lines 11 28). When StartTie= or x. Tie (StartTie+DiffTie) Nτ, we set StartTie to x. Tie and DiffTie to in order to initialize tie intervals in the Tie Interval Bloo Filter (Lines 13 17). Otherwise, we change StartTie and DiffTie. In the case that x. Tie StartTieNτ, the part of the tie interval is useless. To reove the useless part, we set StartTie to x.tie τ and DiffTie to τ (Line 22 23). If x.tie StartTie τ, we set only DiffTie to x. Tie StartTie (Lines 25 26). Fig. 11. Duplicate eliination algorith of the Tie Interval Bloo Filter with space optiization.

12 C.-H. Lee, C.-W. Chung / Data & Knowledge Engineering 7 (211) Probability of detection Major detection region Minor detection region Distance between tag and reader Fig. 12. The detection odel. 7. Paraeter setting The false positive probability of Bloo Filters is kn k and k is set to ðln2þ n to iniize false positive errors, where is the nuber of cells and n is the nuber of eleents in a set. Siilarly, we can find k for Tie Bloo Filters to iniize false positive errors. The false positive probability of Tie Bloo Filters is kn k, where n is the nuber of nonduplicate RFID data within τ tie. Therefore, when k = ðln2þ, the false positive probability is iniized. n In this paper, we do not provide the false positive probability for Tie Interval Bloo Filters. Instead, we provide Theore 2. By Theore 2, if the Tie Bloo Filter and the Tie Interval Bloo Filter have the sae nuber of cells and the sae configuration for hash functions, the Tie Interval Bloo Filter is better than the Tie Bloo Filter in ters of false positive errors. If we set k in the Tie Interval Bloo Filter to ðln2þ like the Tie Bloo Filter, the Tie Bloo Filter and the Tie Interval Bloo Filter n have the sae configuration for hash functions. Therefore, when we set k in the Tie Interval Bloo Filter to ðln2þ,we n guarantee that the Tie Interval Bloo Filter is better than the Tie Bloo Filter if the Tie Bloo Filter and the Tie Interval, where n is the nuber of non- Bloo Filter have the sae nuber of cells. Thus, for both Tie Bloo Filters and Tie Interval Bloo Filters, we set k to duplicate RFID data within τ tie. 8. Experients ðln2þ n In order to easure the effectiveness of our approaches, we conducted an experiental evaluation. There are soe works which deal with the duplicate RFID data eliination proble, however, to the best of our knowledge, our work is the first to systeatically deal with an approxiate duplicate eliination in RFID data streas. Therefore, we copared our two approaches, the Tie Bloo Filter and the Tie Interval Bloo Filter. Also, we included Bloo Filters to observe their behavior when applied to the RFID data strea. We experiented on a Pentiu 3GHz PC with 1 GB ain eory using Java Data sets Since there does not exist a well-known RFID data set, we ade a synthetic data generator siilar to [15]. To siulate the detection of an RFID tag, we used the detection odel in [15]. If the distance between a tag and an RFID reader is far enough, the tag is not detected. As the tag approaches the RFID reader, the tag is detected with the probability proportional to the distance. The region is called the inor detection region as shown in Fig. 12 [15]. When the tag and the RFID reader are in a close distance, the detection probability is constant. The region is called the ajor detection region. Tag generation at the departure Reader Tag reoval at the destination Tag Reader Fig. 13. The tag generation odel.

13 182 C.-H. Lee, C.-W. Chung / Data & Knowledge Engineering 7 (211) BF TBF TIBF processing rate (unit: 1 tuples/sec) The nuber of tuples (unit: 1,,) Fig. 14. The processing rate for SData according to the nuber of tuples. 3 BF TBF TIBF processing rate (unit: 1 tuples/sec) The nuber of tuples (unit: 1,,) Fig. 15. The processing rate for MData according to the nuber of tuples. Since our goal is to test whether duplicate data is eliinated effectively, we assue that tags ove in a straight line as shown in Fig. 13. Tags are generated at only the departure and then ove in a straight line with the velocity that is assigned randoly when tags are generated. We assign the sae velocity for tags generated at the sae tie since tags generally ove together in real applications. If a tag reaches the destination, the tag will disappear. Also, in a straight line, there are any detection locations and there ay exist ultiple readers at detection locations in order to iprove the detection ratio. We generate 1 7,2 1 7,3 1 7,4 1 7,5 1 7, and tuples both in the case where there exists a single RFID reader at a detection location and in the case where there exist three RFID readers at a detection location. We call the forer data SData and the latter data MData. MData has ore duplicate data than SData Experiental results We evaluate the Tie Bloo Filter and the Tie Interval Bloo Filter 4 with respect to a processing rate and an error rate. We show experiental results according to the nuber of tuples in Section and those according to the space size in Section The Tie Interval Bloo Filter in this section uses space optiization.

14 C.-H. Lee, C.-W. Chung / Data & Knowledge Engineering 7 (211) BF TBF TIBF.8 Error Rate The nuber of tuples (unit: 1,,) Fig. 16. The error rate for SData according to the nuber of tuples Experients according to the nuber of tuples Figs. 14 and 15 show the processing rate according to the nuber of tuples. A processing rate is the nuber of tuples that are processed in a unit tie. We fix the aount of allowable eory to bits. The Bloo Filter (BF) shows the best processing rate aong all approaches for SData as shown in Fig. 14 since it has the siplest structure. In the Bloo Filter, the processing rate increases when the nuber of tuples increases. k in the Bloo Filter decreases when the nuber of tuples increases because k is set to ðln2þ n.ifk is sall, we will set and scan a few cells in the Bloo Filter. Thus, in the Bloo Filter, the processing rate increases as the nuber of tuples increases. However, in the Tie Bloo Filter (TBF) and the Tie Interval Bloo Filter (TIBF), since k is not affected by the nuber of tuples, the processing rate is constant. In MData (Fig. 15), the processing rate shows a siilar result as SData (Fig. 14). The processing rates in the Tie Bloo Filter and the Tie Interval Bloo Filter are constant and that in the Bloo Filter increases as the nuber of tuples increases. Also, in the case of the Bloo Filter, the processing rate for MData is less than that for SData because k for MData is larger than k for SData. Note that k is set to ðln2þ n, where is the nuber of cells and n is the nuber of distinct eleents in a set. To validate the effectiveness of the Tie Bloo Filter and the Tie Interval Bloo Filter, we evaluate the error rate according to the the nuber of tuples. The error rate of the Bloo Filter is uch worse than those of other approaches for both SData (Fig. 16) and MData (Fig. 17) as we expected. The error rate of the Tie Interval Bloo Filter is the best aong all approaches and is less 1 BF TBF TIBF.8 Error Rate The nuber of tuples (unit: 1,,) Fig. 17. The error rate for MData according to the nuber of tuples.

15 184 C.-H. Lee, C.-W. Chung / Data & Knowledge Engineering 7 (211) BF TBF TIBF processing rate (unit: 1 tuples/sec) Space size (unit: 1,, bits) Fig. 18. The processing rate for SData according to the space size. than.7% in all cases. The error rate of the Tie Bloo Filter is a little worse than that of the Tie Interval Bloo Filter. The error rates of all approaches increase as the nuber of tuples increases Experients according to the space size We evaluate the processing rate and the error rate according to space size given the nuber of tuples is Figs. 18 and 19 show how the processing rate changes as the space size increases. In the Bloo Filter, the Tie Bloo Filter, and the Tie Interval Bloo Filter, since k is set proportional to the space size, the processing rate shows a tendency to decrease as the space size increases. While the processing rate of the Bloo Filter for SData (Fig. 18) is uch better than other approaches, the processing rate of the Bloo Filter for MData (Fig. 19) is a little better than the Tie Bloo Filter and the Tie Interval Bloo Filter. This is because k of the Bloo Filter for MData increases uch ore than that for SData copared to the Tie Bloo Filter and the Tie Interval Bloo Filter. The error rate according to the space size is shown in Figs. 2 and 21. In the Bloo Filter, the error rate stays alost constant as the space increases since ost of entries in the Bloo Filter are set to 1. However, the error rate of the Tie Bloo Filter, the Tie 5 BF TBF TIBF processing rate (unit: 1 tuples/sec) Space size (unit: 1,, bits) Fig. 19. The processing rate for MData according to the space size.

16 C.-H. Lee, C.-W. Chung / Data & Knowledge Engineering 7 (211) BF TBF TIBF.8 Error Rate Space size (unit: 1,, bits) Fig. 2. The error rate for SData according to the space size. Interval Bloo Filter decreases considerably as the space increases. As shown in Figs. 2 and 21, the error rate of the Tie Interval Bloo Filter is better than other approaches in any cases. Since the Bloo Filter cannot change the cell dynaically, the error rate is high even for the case of the sall nuber of tuples Optial k Like the Bloo Filter, k affects the error rate of the Tie Bloo Filter and the Tie Interval Bloo Filter. To investigate whether the setting of k proposed in Section 7 is valid, we evaluate the error rate as k changes fro 1 to 1. Figs. 22 and 23 show the error rate according to k given that space size is bits and the nuber of tuples is InFigs. 22 and 23, the thick circle eans the k value evaluated by the forula in Section 7 and the double circle eans the k value evaluated by applying the forula of Bloo Filters. In Figs. 22 and 23, all k values corresponding to the thick circles are considered good choices in ters of error rates. However, the error rates corresponding to the double circles are uch worse than the optial error rates. We evaluated the error rate according to k for various space sizes and data sizes. As a result, we conclude that the proposed k value in Section 7 is considered as a good choice in ost cases although it is not always optial. 1 BF TBF TIBF.8 Error Rate Space size (unit: 1,, bits) Fig. 21. The error rate for MData according to the space size.

17 186 C.-H. Lee, C.-W. Chung / Data & Knowledge Engineering 7 (211) TBF TIBF.25.2 Error Rate k Fig. 22. The error rate for SData according to k. 9. Conclusion Generally, RFID data streas include nuerous duplicate data. Since RFID data is produced in a strea, it is difficult to eliinate duplicate RFID data in one pass with a sall aount of eory. Thus, we propose approxiate duplicate eliination ethods, Tie Bloo Filters and Tie Interval Bloo Filters, based on Bloo Filters. In the Tie Bloo Filter, we replace a bit array in the Bloo Filter with an array of tie inforation since the definition of duplicates in RFID data is related to the detected.16 TBF TIBF.12 Error Rate k Fig. 23. The error rate for MData according to k.

C.-H. Lee, C.-W. Chung / Data & Knowledge Engineering 7 (211) 17 187 187 tie. Also, to reduce false positives, the Tie Interval Bloo Filter uses a tie interval.

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea governent (MEST) (No. 211-377).

18 C.-H. Lee, C.-W. Chung / Data & Knowledge Engineering 7 (211) tie. Also, to reduce false positives, the Tie Interval Bloo Filter uses a tie interval. Finally, experiental results show that the proposed approaches can reove duplicate RFID data in one pass with a sall error. Acknowledgents We would like to thank the editor and anonyous reviewers. This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea governent (MEST) (No ). References [1] Manage data successfully with RFID anywhere edge processing, [2] Y. Bai, F. Wang, P. Liu, Efficiently filtering RFID data streas, VLDB Workshop on Clean Databases, 26. [3] C. Ban, B. Hong, D. Ki, Tie paraeterized interval R-tree for tracing tags in RFID systes, DEXA, 25, pp [4] B.H. Bloo, Space/tie trade-offs in hash coding with allowable errors, Counication of the ACM 13 (7) (197) [5] C. Bornhövd, T. Lin, S. Haller, J. Schaper, Integrating autoatic data acquisition with business processes experiences with SAP's Auto-ID infrastructure, VLDB, 24, pp [6] Z. Cao, C. Suttony, Y. Diao, P. Shenoy, Distributed inference and query processing for RFID tracking and onitoring, PVLDB 4 (5) (211) [7] B. Carbunar, M.K. Raanathan, M. Koyutürk, C. Hoffann, A. Graa, Redundant reader eliination in RFID systes, IEEE SECON, 25, pp [8] S.S. Chawathe, V. Krishnaurthy, S. Raachandran, S. Sara, Managing RFID data, VLDB, 24, pp [9] A. Chebotko, S. Lu, X. Fei, F. Fotouhi, RDFProv: a relational RDF store for querying and anaging scientific workflow provenance, Data & Knowledge Engineering 69 (8) (21) [1] F. Deng, D. Rafiei, Approxiately detecting duplicates for streaing data using stable bloo filters, SIGMOD, 26, pp [11] H. Garcia-Molina, J.D. Ullan, J. Wido, Database Syste Ipleentation, Prentice Hall International, Inc., Upper Saddle River, New Jersey, 2, p [12] H. Gonzalez, J. Han, X. Li, FlowCube: constructuing RFID FlowCubes for ulti-diensional analysis of coodity flows, VLDB, 26, pp [13] H. Gonzalez, J. Han, X. Li, D. Klabjan, Warehousing and Analyzing Massive RFID Data Sets, ICDE, 26. [14] Y. Hu, S. Sundara, T. Chora, J. Srinivasan, Supporting RFID-based ite tracking applications in Oracle DBMS using a bitap datatype, VLDB, 25, pp [15] S.R. Jeffery, M.N. Garofalakis, M.J. Franklin, Adaptive cleaning for RFID data streas, VLDB, 26, pp [16] C.-H. Lee, C.-W. Chung, Efficient storage schee and query processing for supply chain anageent using RFID, SIGMOD, 28, pp [17] C.-H. Lee, C.-W. Chung, RFID data processing in supply chain anageent using a path encoding schee, IEEE Transactions on Knowledge and Data Engineering (TKDE) 23 (5) (211) [18] J. Lu, X. Meng, T.W. Ling, Indexing and querying XML using extended Dewey labeling schee, Data & Knowledge Engineering 7 (1) (211) [19] H. Mahdin, J. Abawajy, An approach to filtering duplicate RFID data streas, Counications in Coputer and Inforation Science 124 (21) [2] A. Metwally, D. Agrawal, A.E. Abbadi, Duplicate detection in click streas, WWW, 25, pp [21] J. Rao, S. Doraisway, H. Thakkar, L.S. Colby, A deferred cleansing ethod for RFID data analytics, VLDB, 26, pp [22] F. Wang, P. Liu, Teporal anageent of RFID data, VLDB, 25, pp [23] X. Wang, Q. Zhang, Y. Jia, Efficiently filtering duplicates over distributed data streas, International Conference on Coputer Science and Software Engineering (CSSE), 28, pp Chun-Hee Lee received a PhD degree in coputer science fro the Korea Advanced Institute of Science and Technology (KAIST), Korea, in 21. His research interests include sensor network, strea data anageent, and graph databases. Chin-Wan Chung received a B.S. degree in electrical engineering fro Seoul National University, Korea, in 1973, and a Ph.D. degree in coputer engineering fro the University of Michigan, Ann Arbor, USA, in Fro 1983 to 1993, he was a Senior Research Scientist and a Staff Research Scientist in the Coputer Science Departent at the General Motors Research Laboratories (GMR). Since 1993, he has been a professor in the Departent of Coputer Science at the Korea Advanced Institute of Science and Technology (KAIST), Korea. He was in the progra coittees of ajor database and Web conferences including ACM SIGMOD, VLDB, IEEE ICDE, and WWW. He was an associate editor of ACM TOIT, and is an associate editor of WWW Journal. He will be the general chair of WWW214. His current research interests include the seantic Web, social networks, obile Web, sensor networks and strea data anageent, and ultiedia databases.

Real-Time Detection of Invisible Spreaders

Real-Time Detection of Invisible Spreaders Real-Tie Detection of Invisible Spreaders MyungKeun Yoon Shigang Chen Departent of Coputer & Inforation Science & Engineering University of Florida, Gainesville, FL 3, USA {yoon, sgchen}@cise.ufl.edu Abstract