HISTOGRAMS are an important statistic reflecting the

Size: px

Start display at page:

Download "HISTOGRAMS are an important statistic reflecting the"

Theodora Benson
5 years ago
Views:

1 JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, AUGUST D 2 HistoSketch: Disciminative and Dynamic Similaity-Peseving Sketching of Steaming Histogams Dingqi Yang, Bin Li, Laua Rettig, and Philippe Cudé-Mauoux Abstact Histogam-based similaity has been widely adopted in many machine leaning tasks. Howeve, measuing histogam similaity is a challenging task fo steaming histogams, whee the elements of a histogam ae obseved one afte the othe in an online manne. The eve-gowing cadinality of histogam elements ove the data steams makes any similaity computation inefficient in that case. To tackle this poblem, we popose in this pape D 2 HistoSketch, a similaity-peseving sketching method fo steaming histogams to efficiently appoximate thei Disciminative and Dynamic similaity. D 2 HistoSketch can fast and memoy-efficiently maintain a set of compact and fixed-size sketches of steaming histogams to appoximate the similaity between histogams. To povide high-quality similaity appoximations, D 2 HistoSketch consides both disciminative and gadual fogetting weights fo similaity measuement, and seamlessly incopoates them in the sketches. Based on both synthetic and eal-wold datasets, ou empiical evaluation shows that ou method is able to efficiently and effectively appoximate the similaity between steaming histogams while outpefoming state-of-the-at sketching methods. Compaed to full steaming histogams with both disciminative and gadual fogetting weights in paticula, D 2 HistoSketch is able to damatically educe the classification time (with a 7500x speedup) at the expense of a small loss in accuacy only (about 3.25%). Index Tems Similaity-Peseving Sketching, Histogams, Steaming Data, Concept Dift, Disciminative Weighting 1 INTRODUCTION HISTOGRAMS ae an impotant statistic eflecting the empiical distibution of data. They have been widely used not only as a popula data analysis and visualization tool, but also as an impotant featue fo measuing similaities between data instances, such as colo histogams fo images o wod histogams fo documents. As a esult, histogam-based similaity measues have been extensively exploited in many classification and clusteing tasks and fo vaious application domains, including image pocessing [1], document analysis [2], social netwok analysis [3], and business intelligence [4]. Despite its impotance in machine leaning, computing histogam-based similaities is often difficult in pactice, paticulaly fo data steams. In this study, we conside steaming histogams, whee the elements of a histogam ae obseved ove a data steam as shown in Fig. 1(a) in Section 3. Steaming histogams can be used fo a wide ange of applications, such as solving ange queies and similaity seach in a steaming database, change detection and classification ove data steams. In pactice, steaming histogams ae often seen when online o offline businesses obseve thei customes activity data. Fo example, a Point of Inteest (POI), such as a supemaket o a estauant, may obseve a continuous data steam of visits fom its customes and conside to analyze the histogam of its customes visits. By Dingqi Yang, Laua Rettig and Philippe Cudé-Mauoux ae with the Depatment of Infomatics at the Univesity of Fiboug, Switzeland, {dingqi.yang, laua.ettig, philippe.cude-mauoux}@unif.ch. Bin Li is with School of Compute Science, Fudan Univesity, Shanghai, China, libin@fudan.edu.cn. Manuscipt eceived xxx; evised xxx. measuing the similaity between two POIs based on such histogams, one can build vaious high-quality applications. Fo example, semantic place labeling [4] infes a POI s type based on the assumption that two POIs shaing simila histogams of thei customes pobably belong to the same type. Howeve, it is challenging to measue the similaity between such steaming histogams in pactice, due to the eve-inceasing cadinality of the histogam elements ove time. In the above example, this coesponds to the case of an eve-gowing numbe of customes. The monotonically inceasing size of the steaming histogams makes any similaity computation inefficient, which futhe makes leaning algoithms impactical. To solve this poblem, similaity-peseving data sketching (hashing) techniques [5] have been intensively studied in steam data pocessing [6], [7]. Thei key idea is to maintain a set of compact and fixed-size sketches fo the oiginal data to appoximate thei similaity unde a cetain measue. In the cuent liteatue, most existing data sketching techniques [8], [9], [10], [11] conside the case of steaming data instances, whee complete data instances ae eceived one by one fom a data steam (e.g., a steam of images whose colo histogam can be easily deived). In contast, a steaming histogam assumes that the elements of a histogam descibing an individual data instance ae continuously eceived in abitay ode fom a data steam (e.g., the histogam of customes visits to a POI), which depats fom classical techniques that focus on sketching complete data instances. Theefoe, these methods cannot be efficiently applied fo sketching steaming histogams. In this pape, we tackle the similaity-peseving data

2 JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, AUGUST sketching poblem fo steaming histogams. Specifically, an efficient similaity-peseving sketching method fo steaming histogams should allow fo fast and memoy-efficient maintenance of the sketches. Fast maintenance equies sketches of steaming histogams to be incementally updatable. In othe wods, the new sketch of a steaming histogam should be incementally computed fom the fome sketch and the newly aived element. Moeove, memoyefficient maintenance equies the sketching method to ceate a small and bounded memoy ovehead when computing the sketches, which diffes fom existing sketching methods that equie a lage set of andom vaiables as in-memoy paametes [11], [12], [13], [14], whee the size of these paametes is popotional to the cadinality of the histogam elements. In addition, to maintain high-quality similaitypeseving sketches, the following two issues should be consideed when measuing similaities. Fist, as histogam elements ae not all equally impotant when measuing histogam-based similaity, disciminative similaity should be consideed, which efes to the similaity that impoves the disciminative capability of some classification/clusteing methods [15]. Specifically, in the case of labeled histogams, a histogam element appeaing only in the histogams of a specific label has moe disciminative capability than one appeaing unifomly in all histogams. Taking the example of semantic place labeling whee we want to classify POIs accoding to thei customes visiting pattens, it means that visits fom uses having stonge pefeences on visiting a specific type of POIs ae moe disciminative. Fo static datasets, such a disciminative similaity can be easily computed using vaious featue weighting methods [16] to give a highe weight to moe disciminative histogam elements. Howeve, it is not staightfowad to incopoate such a disciminative similaity in sketching steaming histogams, whee disciminative weights have to be updated ove time, and moe impotantly, to be incopoated in the sketches. Second, as a common poblem in data steams, concept dift should also be taken into account fo steaming histogams, whee the undelying distibution of a steaming histogam changes ove time in unfoeseen ways. Taking the example of customes visits to POIs, the custome population of a estauant may change abuptly if the estauant changes its type (e.g., fom a Japanese estauant to a pizzeia), o gadually if it updates its menu. It is theefoe citical to conside this issue in sketching steaming histogams. The most common appoach to handle concept dift is fogetting the outdated data [17]. A typical solution is gadual fogetting, whee the steaming data ae associated with weights invesely popotional to thei age [18]. Taking exponential decay [19] as an example, the weight of a histogam element deceases by a weight decay facto evey time when a new element is eceived fom the data steam. Although such a weighting pocess is easy to implement when building a histogam fom its steaming elements, it is not staightfowad to incopoate such weights dynamically in sketching steaming histogams. This poblem becomes even moe challenging when futhe consideing the equiement of incementally updatable sketching. To addess the above challenges, we intoduce D 2 HistoSketch, a similaity-peseving sketching method fo steaming histogams to appoximate thei Disciminative and Dynamic similaity. D 2 HistoSketch is designed to efficiently maintain a set of compact and fixed-sized sketches ove steaming histogams to appoximate the similaity between the histogams. Specifically, to measue the similaity between histogams, ou method focuses on nomalized min-max similaity, which has been poven to be an effective similaity measue fo nonnegative data in vaious application domains [11]. To ceate a sketch fom a histogam, we boow the idea fom consistent weighted sampling [20] that was oiginally poposed fo appoximating minmax similaity fo complete data instances. In addition, we fomally deive a memoy-efficient sketching method with few in-memoy paametes. To efficiently maintain the sketch ove the steaming histogam elements, we fist adjust the oiginal sketch to seamlessly incopoate both disciminative weights and gadual fogetting weights, and then incementally compute the new sketch based on the adjusted sketch and the incoming histogam element. Ou main contibutions can be summaized as follows: To the best of ou knowledge, this is the fist wok consideing the disciminative and dynamic similaitypeseving sketching ove steaming histogams. We design an efficient similaity-peseving sketching method fo steaming histogams, D 2 HistoSketch, which allows fo fast and memoy-efficient maintenance of the sketches, whee the sketches can be incementally updated with a small and bounded memoy ovehead. To povide high-quality similaity-peseving sketches fo downsteam tasks, D 2 HistoSketch consides both disciminative similaity that impoves the disciminative capability of the sketches fo some classification/clusteing methods, and dynamic similaity that adapts to concept dift by gadually fogetting outdated histogam elements. We empiically evaluate ou method on multiple classification tasks using both synthetic and eal-wold datasets. Ou esults show that D 2 HistoSketch is able to efficiently appoximate disciminative and dynamic similaity fo steaming histogams. Compaed to full steaming histogams, D 2 HistoSketch is able to damatically educe the classification time (with a 7500x speedup) at the expense of a modest loss in accuacy (about 3.25%). Compaed to ou pevious wok HistoSketch [14], in this pape, we fist fomally deive a memoy-efficient sketching method with fewe in-memoy paametes, which also allows fo faste sketch maintenance. We then add disciminative similaity fo bette similaity measuement. In paticula, the new expeiments we pesent show that compaed to ou pevious wok ou newly poposed D 2 HistoSketch achieves both a highe classification accuacy (with a 2.78% impovement on aveage) as well as faste sketch maintenance (with a 5.9% impovement on pocessing speed). The est of the pape is oganized as follows. Section 2 pesents the elated wok. Section 3 fist pesents the peliminay of ou wok. Aftewads, we pesent the poposed D 2 HistoSketch method in Section 4. The expeimental evaluation is shown in Section 5. We conclude ou wok in Section 6.

3 JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, AUGUST RELATED WORK As a key statistical tool in empiical data analysis, histogams have been widely used not only as a popula visualization of empiical data distibution [21], but also as a featue to measue data similaity that is futhe exploited in many machine leaning tasks [1], [2], [3], [22], [23]. Although the histogam of a static dataset can often be easily computed, it is pactically difficult to compute histogams fo data steams with typically unknown cadinality and which thus equie an unbounded amount of memoy to maintain the histogam. In this context, count sketch [24] and count-min sketch [25] and othe online histogam building methods [26] wee poposed to appoximate the fequency table of elements (i.e., histogams) fom a data steam with a fixed-size data stuctue. Howeve, the esulting sketches do not peseve the similaity between diffeent data steams. This pape diffes fom the objective of the above sketching methods by addessing the poblem of similaity-peseving sketching of data steams. Similaity-peseving sketching [5] has been extensively studied to efficiently appoximate the similaity of high dimensional data, such as gaphs [27], images [28], [29] and videos [30], [31]. Its basic idea is to maintain a set of compact sketches of the oiginal high dimensional data to efficiently appoximate thei similaities, such as Jaccad [8], [9], cosine [10], and min-max [4], [11], [12], [13], [20], [32] similaities. These sketches can then enable many applications, paticulaly fo infomation etieval systems like image o document seach engines [28]. Howeve, most of the existing methods ae designed to sketch complete data instances, which ae fundamentally diffeent fom steaming histogams, whee histogams ae incementally built fom the steams of its elements. Moe impotantly, little attention has been given on studying high-quality similaitypeseving sketches, on which we put a paticula focus in this pape by consideing both disciminative and dynamic similaities. Disciminative similaity efes to the similaity that impoves the disciminative capability of some classification/clusteing methods [15]. It has been widely used in many machine leaning applications. Fo example, disciminative similaity has been shown to be able to significantly impove the pefomance of vaious compute vision and patten ecognition tasks [33]. In pactice, disciminative similaity can be computed using featue weighting techniques, whee a similaity function is paameteized with disciminative weights [16] (e.g., entopy weights [34]). Although it is easy to implement disciminative similaity fo static datasets, it is not staightfowad to apply it to similaity-peseving sketching of steaming histogams, as such disciminative weights have to be dynamically ecomputed ove time, and moe impotantly, to be incopoated in the sketches. Dynamic adaptation to concept dift is a common poblem in steaming data pocessing. It efes to the case whee the undelying statistical popeties of the steaming data change ove time (often in unknown ways), which futhe degades the pefomance of leaning algoithms [35]. Accoding to a ecent suvey on concept-dift, the most popula appoach to handling data steams with unknown dynamics is fogetting outdated data [17]. Existing solutions can be classified into two categoies. Fist, the abupt fogetting appoach selects a set of data fo leaning. A sliding window is often used to select ecent data. Although this appoach is effective against abupt difts (in tems of the data statistical popeties), it is less applicable to gadual difts [36] as it gives the same significance to all selected pieces of data while completely discading all othe data. Second, the gadual fogetting appoach assigns weights that ae invesely popotional to the age of the data [18], such as exponential decay weights [19]. In this study, we advocate the gadual fogetting appoach to tackle the concept-dift poblem in steaming histogams. The histogam is built with weighted elements fom the steams, with weights deceasing ove time. Diffeent fom existing methods using gadual fogetting against concept dift, we conside incopoating the gadual fogetting appoach in similaitypeseving sketching of steaming histogams. Compaed to ou pevious wok HistoSketch [14], this pape makes the following impovements. Fist, we fomally deive a memoy-efficient sketching method D 2 HistoSketch with few in-memoy paametes, while ou pevious wok HistoSketch equies a lage set of andom vaiables as in-memoy paametes fo sketching, whee the size of these paametes is popotional to the cadinality of the histogam elements. Second, we conside both disciminative and dynamic similaity in D 2 HistoSketch, while HistoSketch only consides the dynamic adaptation to concept dift. Finally, we conduct new expeiments to futhe validate D 2 HistoSketch; in paticula, compaed to ou pevious wok HistoSketch, ou newly poposed D 2 HistoSketch achieves both a highe classification accuacy (with a 2.78% impovement on aveage) as well as faste sketch maintenance (with a 5.9% impovement on pocessing speed). 3 PRELIMINARY AND PROBLEM FORMULATION In this section, we fist fomally intoduce steaming histogams and then min-max similaity, followed by ou poblem definition of similaity-peseving sketching. 3.1 Steaming Histogam We conside a steaming histogam computed ove a data steam of its elements x t whee t N indicates the ode of the obseved element in the steam. Element x t E ae obseved one by one, whee E efes to the set of all histogam elements obseved in any steaming histogams so fa, which expands ove time with new (unseen) histogam elements. In othe wods, E can be egaded as the union of histogam elements fom all histogams obseved so fa. Due to the steaming natue, the cadinality E is unknown and continuously inceases ove time. A classical histogam can then be epesented as a vecto V N E, whee each value encodes the cumulative count of the coesponding histogam elements i E, i.e., = t 1 x t=i, whee 1 cond is an indicato function which is equal to 1 when cond is tue and 0 othewise. Fig. 1(a) illustates an example of a classical steaming histogam. To measue the disciminative and dynamic similaity between steaming histogams, we build the histogam V

JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4 fom its weighted elements.

1 Steaming Histogam with Disciminative Weights Each histogam element i is associated with a disciminative weight wi d.

Taking the histogam of customes visits to a POI as an example, a POI can be associated with a categoy (label l) fast food estauant, and L efes to all possible POI categoies hee.

(We note that ou appoach is not limited to any specific weighting function.

4 JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, AUGUST fom its weighted elements. Specifically, we assign a disciminative weight and a gadual fogetting weight to each steaming element to measue disciminative and dynamic similaity between steaming histogams, espectively Steaming Histogam with Disciminative Weights Each histogam element i is associated with a disciminative weight wi d. Specifically, we assume that each histogam is associated with a label l, whee l L and L efes to a set of labels fo histogams. Taking the histogam of customes visits to a POI as an example, a POI can be associated with a categoy (label l) fast food estauant, and L efes to all possible POI categoies hee. To compute the disciminative weight wi d fo each histogam element i, we leveage a widely used weighting function, i.e., entopy weighting [34]. (We note that ou appoach is not limited to any specific weighting function.) Specifically, we use entopy weighting to empiically measue the uncetainty of the labels when obseving individual histogam elements. Fo each histogam elements i in E, we compute its entopy weight wi d as follows: wi d l L P [l, i] log P [l, i] = 1 + (1) log L whee P [l, i] is the pobability that a histogam with element i is labeled as l, l L. Highe values of wi d imply highe degees of disciminability fo the coesponding histogam element i. To calculate P [l, i], we maintain a label fequency vecto F l of size E fo each label to ecod the cumulative fequency of each histogam element in E. With each incoming histogam element i, we update Fi l accodingly. Thus, we ae able to empiically compute F P [l, i] = l i at any time. Paticulaly, fo each incoming histogam element i, only the coesponding wi d needs l L F i l to be updated. In pactice, F l is maintained in a count-min sketch data stuctue fo memoy efficiency (we will discuss it in Section 4.4). Subsequently, we compute the histogam V R E >0 such that is associated with its weighted wi d, i.e., = t wd i 1 x t=i. Fig. 1(b) shows an example of a steaming histogam with disciminative weights Steaming Histogam with Gadual Fogetting Weights Each steaming element x t is associated with a gadual fogetting weight w g t, which is invesely popotional to its age. To compute w g t, we adopt the exponential decay weight [19], which is computed as follows: w g t = e λ(tn t) (2) whee t n is the ode of the latest histogam element eceived fom the steam and λ is the weight decay facto. Subsequently, we compute the histogam V R E >0 such that is the weighted cumulative count of the coesponding histogam elements i, i.e., = t wg t 1 xt=i. Fig. 1(c) shows an example of a steaming histogam with gadual fogetting weights. (a) Classical histogam (unweighted) = t 1 x t =i (b) Histogam with disciminative weights = t wd i 1 x t =i (c) Histogam with gadual fogetting weights = t wg t 1 x t =i (d) Histogam with both weights = t wd i wg t 1 x t =i Fig. 1. Illustation of the steaming histogams with diffeent weights. Left: The diffeent histogam elements ae assigned diffeent colos, and thei heights indicate the coesponding weights. Right: The same elements ae accumulated to build the coesponding histogam Steaming Histogam with Both Weights We combine both the disciminative and the gadual fogetting weights to compute the histogam, = t wd i w g t 1 xt=i. Fig. 1(d) shows an example of a steaming histogam with both weights. 3.2 Nomalized Min-Max Similaity To measue the similaity between steaming histogams, we esot to nomalized min-max similaity, which has been shown to be an effective similaity measue fo nonnegative data. Moe pecisely, Li [11] conducted an extensive study on min-max similaity (named as min-max kenel in [11]) by compaing fou kenels including linea kenel, min-max kenel, nomalized-min-max kenel and intesection kenel, on diffeent classification tasks ove a sizable collection of public datasets. The esults illustate the advantages of the (nomalized) min-max kenels. Fomally, given two steaming histogams V a and V b, the min-max similaity is defined as follows: i E min(v a Sim MM (V a, V b i ) =, b ) i E max( a, b) (3) As a histogam is often used to chaacteize the empiical data distibution, we apply the sum-to-one nomalization befoe computing the similaity: i E V a i = 1, i E V b i = 1. (4) In that way, Eq. 3 becomes the nomalized min-max similaity, denoted by Sim NMM.

5 JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, AUGUST Poblem Fomulation Fo steaming histogams, the eve-inceasing cadinality E makes the computation of the nomalized min-max similaity become inefficient. Theefoe, we popose to maintain two sketches S a and S b of size K (K E ) fo V a and V b, espectively, with the popety that thei collision pobability (i.e., P [Sj a = Sj b ], whee j = 1, 2,..., K) is exactly the nomalized min-max similaity between V a and V b : P [Sj a = Sj b ] = Sim NMM (V a, V b ) (5) Then, the nomalized min-max similaity between V a and V b can be appoximated by the Hamming similaity between S a and S b. The computation ove S, which is compact and of fixed size, is much moe efficient than the one ove the full histogam V, which is a lage, eve-gowing vecto. The poblem tackled by this pape is how to ceate and maintain the above similaity-peseving sketch S fo the steaming histogam V with both disciminative and gadual fogetting weights in a fast and memoy-efficient manne. 4 D 2 HISTOSKETCH Ou D 2 HistoSketch is designed to efficiently maintain a set of compact and fixed-size sketches fo steaming histogams with both disciminative and gadual fogetting weights, in ode to efficiently appoximate thei nomalized minmax similaities. In this section, we fist pesent consistent weighted sampling, which inspied ou method. We then descibe ou method fo sketch ceation, followed by the poposed incemental sketch update pocess. 4.1 Consistent Weighted Sampling Consistent weighted sampling was oiginally poposed to appoximate min-max similaity fo complete and high dimensional data (e.g., a vecto of lage size) [11], [12], [13], [20]. The basic idea is to geneate data samples such that the pobability of dawing identical samples fo a pai of vectos is equal to thei min-max similaity. A set of such samples can then be egaded as a sketch of the input vecto. The fist consistent weighted sampling method [20] was designed to handle intege vectos. Specifically, taking a classical histogam V N E as an example, it fist uses a andom hash function h j to geneate independent and unifom distibuted andom hash values h j (i, f) fo each (i, f), whee i E and f {1, 2,..., }, and then etuns (i j, f j ) = agmin i E,f {1,2,...,} h j (i, f) as one sample (i.e., one sketch element S j ). Note that the andom hash function h j depends only on (i, f), and maps (i, f) uniquely to h j (i, f). By applying K independent andom hash functions (j = 1, 2,..., K), we geneate sketch S (of size K) fom V (of abitay size). Following this pocess, the collision pobability between two sketch elements (i a j, f a j ) and (i b j, f b j ), which ae geneated fom V a and V b, espectively, is poven to be exactly the min-max similaity of the two vectos [20], [32]: P [(i a j, fj a ) = (i b j, fj b )] = Sim MM (V a, V b ) (6) To impove the efficiency of the above method and allow eal vectos as input, Ioffe [12] late poposed an impoved method. Its key idea is that, athe than geneating diffeent andom hash values (whee has to be an intege), it diectly geneates one hash value a i,j (and its coesponding f N, f ) fo each i by taking as the input of the andom hash value geneation pocess. In such a case, can be any positive eal numbe. Based on this method, Li [11] futhe poposed to simplify the sketch by only keeping i j athe than (i j, f j ), and empiically poved the following popety: P [i a j = i b j ] P [(i a j, fj a ) = (i b j, fj b )] (7) A shot desciption of the method poposed in [11] is pesented in the following. To geneate one sketch element S j (sample i j ), the method fist daws thee andom vaiables offline as in-memoy paametes: i,j Gamma(2, 1), c i,j Gamma(2, 1) and β i,j Unifom(0, 1), and then computes ( y i,j = exp ( i,j log V )) i + β i,j β i,j i,j (8) c i,j a i,j = y i,j exp( i,j ) (9) The sketch element is then etuned as S j = agmin i E a i,j. Please efe to [11], [12] fo moe details and fo the poof of Eq. 6 and 7. In this pape, we design a fast and memoy-efficient sketching pocess to handle steaming histogams with both disciminative and gadual fogetting weights, whee V R E >0. Specifically, we popose a new appoach to diectly compute a i,j, which has the following thee highly desiable popeties: 1) The geneated a i,j follows the exact same distibution as the one geneated using Eq. 9, which ensues the coectness of ou sketching method (i.e., Eq. 6 and Eq. 7 still hold). 2) The a i,j sampling pocess does not equie the lage set of in-memoy paametes i,j, c i,j, β i,j whee i E and j = 1, 2,..., K; this makes ou method memoyefficient. 3) The ceated sketch S is invaiant unde unifom scaling of V, which seves as a basis fo the fast incemental sketch update. In the following, we fist pesent ou sketch ceation method, and then the incemental sketch update pocess. 4.2 Sketch Ceation Ou sketch ceation method boows the idea of consistent weighted sampling with eal numbe inputs [11]. Diffeent fom the oiginal method, we popose a new appoach to compute a i,j. Specifically, the objective of the oiginal method is to sample y i,j such that log y i,j is unifomly distibuted on [log i,j, log ] conditioned on i,j. Among many possible fomulations that can fulfill this distibution equiement, the oiginal fomulation (Eq. 8) is specifically designed to also sample the coesponding f j to obtain the sketch element (i j, f j log Vi ) (whee f = i,j + β i,j in [12]). Howeve, as poved in [11], fj can be ignoed fom the

6 JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, AUGUST sketch (i j, f j ) (i.e., Eq. 7). In such a case, it is only necessay to sample y i,j satisfying its distibution equiement, and then compute the coesponding a i,j. In the following, we fist deive a simplified method to geneate y i,j, based on which we then deive a memoy-efficient method to diectly compute a i,j. Fist, we popose to compute y i,j as follows: y i,j = exp(log i,j β i,j ) (10) fo which the following poposition holds. Poposition 1. Eq. 10 geneates y i,j following the same distibution as geneated by Eq. 8, i.e., log y i,j follows a unifom distibution on [log i,j, log ] conditioned on i,j. Poof. Consideing the vaiable log y i,j, Eq. 8 can be deived as (we ignoe the subscipt (i, j) of and β in the following poof): log y i,j = ( log = log ) + β β (( log Vi + β ) log V ) (11) i + β log Vi log Vi whee ( + β) + β is the fac function of log + β, which etuns its factional pat [37]. Since both and ae known, this fac function can be consideed log Vi log Vi as fac(β + C), whee C = is a constant. Consideing β Unifom(0, 1), this function actually etuns the factional pat of a vaiable following Unifom(C, C + 1), which emains the same as Unifom(0, 1). In othe wods, it is a unifom mapping fom Unifom(0, 1) to itself. Subsequently, we have fac( + β) Unifom(0, 1), which can be eplaced by β. Theefoe, we obtain: log y i,j = log β (12) which is the same as in Eq. 10. Theefoe, y i,j geneated by Eq. 10 follows the same distibution as geneated by Eq. 8. We intoduce z = log y i,j and compute its Cumulative Distibution Function (CDF) as follows: P (Z < z) = P (log β < z) ( ) log Vi z = P < β Consideing β Unifom(0, 1), we obtain: P (Z < z) = 1 log z = z (log ) log (log ) (13) (14) which is the CDF of Unifom(log, log ). This completes the poof. Second, based on Poposition 1, we deive a memoyefficient method to diectly compute a i,j as follows: a i,j = log β (15) fo which the following poposition holds. Poposition 2. Eq. 15 geneates a i,j following the same distibution as geneated by Eq. 9. Poof. As suggested by Poposition 1, we now use Eq. 10 to sample y i,j. Combining Eq. 10 with Eq. 9 and knowing (1 β) and β both follow Unifom(0, 1), we obtain: a i,j = c exp( β) (16) Consideing a new vaiable m = β, we can compute its Pobability Distibution Function (PDF) as: f M (m) = β f B(β)f R ( m β )dβ (17) whee f B ( ) and f R ( ) ae the PDF of β Unifom(0, 1) and Gamma(2, 1), espectively. We futhe deive f M (m) as follows: f M (m) = β 1 m β exp( m β )dβ = exp( m) (18) We see that m = β follows the exponential distibution Exp(1), and can then be sampled as m = log β, β Unifom(0, 1). Subsequently, Eq. 16 can be simplified as: a i,j = cβ (19) It is woth noting that the numeato cβ follows the same distibution as β Exp(1), as they ae both the multiplication of a vaiable fom U nif om(0, 1) and a vaiable fom Gamma(2, 1). Theefoe, cβ can be futhe simplified as log β whee β Unifom(0, 1). We note that as β, β, β Unifom(0, 1), we keep using only β Unifom(0, 1) fo the sake of simplicity fo notations, and obtain: a i,j = log β (20) which is the same as Eq. 15. This completes the poof. Poposition 2 ensues the coectness of ou sketching method (i.e., Eqs. 6 and 7). Moe impotantly, ou method only equies the paametes β i,j Unifom(0, 1), which can be efficiently geneated using andom hash functions athe than being maintained as in-memoy paametes Memoy-Efficient Sketching Implementation To efficiently geneate β i,j in an online manne, we apply a andom hash function h j on i to obtain the coesponding hash value h j (i), which follows a unifom distibution ove (0, 1). We then use h j (i) to eplace β i,j. With K independent andom hash functions {h j j = 1, 2,..., K}, we can compute a i,j in a memoy-efficient manne as follows: a i,j = log h j(i) (21) In summay, ou memoy-efficient method maintains only K independent andom hash functions {h j j = 1, 2,..., K}, athe than the lage set of in-memoy paametes i,j, c i,j, β i,j whose size is popotional to the cadinality of the histogam elements E. Alg. 1 shows the sketch ceation pocess. Note that we keep both the sketch S and its hash values A (the latte will be used fo incemental sketch update). Fig. 2 shows the sketch ceation pocess fo one sketch element S j.

JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7 Fig. 3.

Ceating one sketch element fom histogam V with cadinality E = 5 (i = 1, 2,..., 5).

Algoithm 1 Sketch ceation Input: Histogam V, Sketch length K, Independent andom hash functions {h j j = 1, 2,..., K} Output: Sketch S and the coesponding hash values A 1: fo j=1,2,.

7 JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, AUGUST Fig. 3. Illustation of deceasing gadual fogetting weights (exponentially) when a new histogam element x t+1 is eceived (fome weights ae epesented in gay) Fig. 2. Ceating one sketch element fom histogam V with cadinality E = 5 (i = 1, 2,..., 5). By computing the hash value a i,j fo each i, we select the histogam element whose hash value is minimal as the sketch element and also keep its coesponding hash value, i.e., (S j = 3, A j = 0.14). Algoithm 1 Sketch ceation Input: Histogam V, Sketch length K, Independent andom hash functions {h j j = 1, 2,..., K} Output: Sketch S and the coesponding hash values A 1: fo j=1,2,...,k do log hj(i) 2: Compute a i,j = 3: Set sketch element S j = agmin i E a i,j 4: Set the coesponding hash value A j = min i E a i,j 5: end fo 6: etun S and A β) Unifom(0, 1), the computation can be simplified to ĥ j (i) = 1 β. As we only cae about the odeing of those hash values, we can futhe simplify the computations via a monotonic tansfomation, and obtain ĥj(i) = log β. By eplacing β with a andom hash value h j (i) (h j is a andom hash function), we obtain: ĥ j (i) = log h j(i) (25) which is the same as Eq. 21. In othe wods, a i,j is equal to ĥ j (i) = min f=1,2,...,vi h j (i, f) fo intege V N E. Theefoe, ou method can be egaded as a genealization of the oiginal consistent weighted sampling method to measue the nomalized min-max similaity with eal V R E Intinsic Connection to Consistent Weighted Sampling with Intege Vectos We now discuss its intinsic connection to the oiginal consistent weighted sampling method with intege vecto inputs [20], and show that ou method is indeed a genealization of the oiginal method to eal vectos inputs. The oiginal consistent weighted sampling method [20] uses a andom hash function h j to geneate the hash values h j (i, f) fo each (i, f), whee i E and f {1, 2,..., }, and then etuns (i j, f j ) = agmin i E,f {1,2,...,} h j (i, f) as one sample (i.e., one sketch element S j ). As suggested by [11], the sketches can be simplified by keeping only i j and discading f j. By defining ĥj(i) = min f=1,2,...,vi h j (i, f), one sample is then etuned as i j = agmin i E ĥj(i). We now deive a new method to diectly compute ĥ j (i) without involving f. The key idea is to diectly compute ĥj(i) fom only one andom hash value and, athe than andom hash values. Specifically, since h j Unifom(0, 1), we compute the CDF of ĥj(i) as: P [ĥj(i) p] = 1 (1 p) Vi (22) fo p (0, 1). We assume a andom unifomly distibuted vaiable β U nif om(0, 1), and fomulate its CDF P [β q] = q, q (0, 1) as: P [β q] = P [β 1 (1 p) Vi ] = 1 (1 p) Vi (23) whee q = 1 (1 p) Vi is an invetible continuous inceasing function (p = 1 (1 q)) ove (0, 1). By applying the change-of-vaiable technique on Eq. 23, we obtain: P [1 (1 β) p] = 1 (1 p) Vi (24) Since Eqs. 24 and 22 show the same CDF, ĥ j (i) can be obtained by ĥj(i) = 1 (1 β). As (1 4.3 Incemental Sketch Update Incemental sketch update equies that a new sketch S(t + 1) can be fast computed based on the fome sketch S(t) (with its coesponding hash values A(t)) and the incoming histogam element x t+1. Specifically, when a new histogam element x t+1 = i is eceived fom the data steam, the gadual fogetting weights of all existing elements of the histogam evolve by a facto of e λ (as shown in Fig. 3), which esults in a unifom scaling of V. One of the key popeties of ou sketch is that it is invaiant unde unifom scaling of V, which allows us to pefom the scaling by quickly adjusting A only. Aftewads, as only the disciminative weight of histogam element i is updated, we only need to ecompute, and then its coesponding hash value a i,j. Finally, the new sketch S(t + 1) can be computed based on the adjusted sketch S(t) (with A(t)) and the incoming histogam element x t+1 = i (with a i,j). In the following, we fist pesent the unifom scaling invaiance popety of ou sketch, and then the incemental update pocess. Poposition 3. If sketch S (with its coesponding hash values A) is ceated fo V using Alg. 1, then fo any positive constant γ, S emains the sketch fo γv with the coesponding hash values 1 γ A. Poof. Alg. 1 computes a sketch element S j follows: fo γv as a i,j = log h j(i) = 1 γ γ a i,j (26) ( ) 1 S j = agmin i E γ a i,j = agmin a i,j = S j (27) i E A j = min i E This completes the poof. 1 γ a i,j = 1 γ min i E a i,j = 1 γ A j (28)

JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8 Fig. 4. An example of incementally updating one sketch element. I).

By adding the incoming histogam element x t+1 = 2 (2 E) with weight wi d 1 to the scaled histogam, we ecompute only the hash value fo i = 2, i.e., a 2,j = 0.143. III).

8 JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, AUGUST Fig. 4. An example of incementally updating one sketch element. I). Accoding to the scaling of the histogam, we keep the sketch invaiant S j (t) = 3, and adjust its hash value fom A j (t) = 0.14 to A j (t) e λ = (λ = 0.05 in this example). II). By adding the incoming histogam element x t+1 = 2 (2 E) with weight wi d 1 to the scaled histogam, we ecompute only the hash value fo i = 2, i.e., a 2,j = III). By selecting the minimum hash value between A j (t) e λ = and a 2,j = 0.143, we update S j (t + 1) = 2 and A j (t + 1) = Poposition 3 seves as a basis fo ou incemental sketch update pocess, which woks as follows (fo one sketch element S j ): I. When a new histogam element x t+1 = i is eceived, we scale V (t) by a facto of e λ, and adjust the sketch accoding to Poposition 3, i.e., S j (t) and A j (t) e λ. II. We add the incoming histogam element i to the scaled histogam. Fist, we update the disciminative weight wi d accoding to the method descibed in Section Then, as the newest histogam element always has gadual fogetting weight 1 (w g t = e λ(tn tn) = 1), we add the incoming histogam element i with the oveall weight wi d 1 to the scaled histogam: (t+1) = (t) e λ +wi d 1 if i E. In case i / E, we add i to E and expand V to include (t+1) = wi d 1. Aftewads, we ecompute only the hash value fo i, i.e., a i,j. III. By compaing the new hash value a i,j with the adjusted hash value A j (t) e λ, we update the sketch S j (t + 1) = i and A j (t + 1) = a i,j if a i,j < A j (t) e λ, S j (t + 1) = S j (t) and A j (t + 1) = A j (t) e λ othewise. Fig. 4 illustates the incemental sketch update pocess following the pevious example shown in Fig Implementation Details Ou incemental sketch update pocess equies to access the fome histogam V (t) to compute V (t+1). To maintain such a steaming histogam V of an eve-inceasing size, we popose an extended count-min sketch model. The classical count-min sketch [25] is a fixed-sized pobabilistic data stuctue Q (d ows and g columns) seving as a fequency table of steaming elements. It uses d independent andom hash functions h l (l = 1, 2,..., d) to map steaming elements onto a ange of 1, 2,..., g (countes). Evey time a new element i is eceived, fo each ow l, its hash function h l is applied to i to detemine a coesponding column h l (i), and then the counte Q l,hl (i) is inceased by 1. To get the estimated fequency at time t, the coesponding hash function is applied to i to look up the coesponding counte fo each ow. The estimate is then etuned as the minimum of all the pobed countes acoss all ows, i.e., (t) = min l Q l,hl (i). The estimated fequency eo is guaanteed [38] to be at most 2 g with pobability 1 ( 1 2 )d. In ou case of steaming histogam with both disciminative and gadual fogetting weights, the cumulative fequency (weights) of all histoical histogam elements is scaled with a facto e λ (i.e., V (t) e λ ) evey time a new element is eceived fom the data steams. Theefoe, we extend the above count-min sketch method to conside such decay weights as follows. Fist, befoe adding a new element i fom the data steam, we unifomly scale all countes acoss all ows by that facto, i.e., Q l,hl (i)(t) e λ. In such a way, fo any histogam element i, its estimated weighted cumulative count (t) is also scaled to min l Q l,hl (i)(t) e λ = (t) e λ, which coesponds exactly to Step I in ou sketch update pocess. Aftewads, we add the new element with its coesponding weight wi d 1, i.e., Q l,hl (i)(t + 1) = (t) e λ + wi d 1, which coesponds to Step II in ou sketch update pocess. As the afoementioned estimated fequency eo of count-min sketch depends only on its stuctual paamete g and d, it is easy to see that the estimated eo does not change unde this staightfowad extension, as a unifom scaling does not affect the data stuctue itself. Fo the detailed poof of the estimated eo, please efe to [38]. In this study, we empiically set the default paametes d = 10, g = 50 to guaantee an eo of at most 4% with pobability (see Section fo moe details). In addition, we also use a classical count-min sketch to stoe the label fequency vecto F l, which is used to compute the disciminative weights as descibed in Section Fo the sake of simplicity, we keep the same paametes as the one we used fo V. 4.5 Analysis of D 2 HistoSketch Discussion on Eo Bound The oiginal consistent weighed sampling technique [32] have poven the coectness of Eq. 6, with its associated eo bound. Howeve, this method takes only intege vectos as input, and equies a sketch element (i j, f j ) which pohibits its application to steaming histogams. To handle steaming histogams, ou poposed method is based on the 0-bit consistent weighted sampling technique poposed by [11]. This technique is able to take eal vectos as input, and moe impotantly, the autho empiically poved Eq. 7, which seves as a fundamental element of ou method. Howeve, as mentioned by the autho in [11], a igoous poof of Eq. 7 tuns out to be a difficult pobability poblem, which is not given in the oiginal pape. In this study, we focus on designing a sketching method fo steaming histogams with both disciminative and gadual fogetting weights, and we have igoous poven the coectness of ou method by showing its equivalence to the 0-bit consistent weighed sampling technique (using Poposition 1 and 2). Then, ou method

JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9 has the same eo bound as the 0-bit consistent weighed sampling technique [11].

9 JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, AUGUST has the same eo bound as the 0-bit consistent weighed sampling technique [11]. Howeve, the futhe poof about the 0-bit consistent weighed sampling technique and its associated eo bound emains a difficult poblem, and is outside the scope of ou pape Time and Space Complexity Analysis Time. Since ou sketches ae incementally maintained, we discuss the time complexity fo each incoming histogam element fom the data steams. Specifically, to update a sketch of length K, we pefom ou incemental sketch update pocess fo all K sketch elements, which takes O(K) time. In addition, we also need to etieve/update the coesponding count-min sketches fo both V and F l (with d ows), which takes O(d) time. The total time complexity fo updating one histogam element is hence O(K + d). Compaed to ou pevious wok HistoSketch [14], ou poposed method with few in-memoy paametes can compute sketches in a moe efficient way, showing a highe pocessing speed with an impovement of 12.2% (see Section fo moe detail). Space. Each steaming histogam is epesented by a sketch of length K taking O(K) space. Fo the incemental sketch update pupose, we stoe the aw steaming histogam n a count-min sketch data stuctue (d ows and g columns) taking O(dg) space. Subsequently, the total space complexity is O(K +dg) fo one steaming histogam. Consideing a set of n histogams with a set of L labels (the label fequency vecto F l stoed in a count-min sketch data stuctue with d ows and g columns), the total space complexity is O(n (K + dg) + L (dg)). Note that ou space complexity is linea w..t. the numbe of histogams n, and moe impotantly, that it is independent of the eve-gowing cadinality of steaming histogam elements E. Compaed to ou pevious wok HistoSketch [14] that equies in-memoy paametes i,j, c i,j, β i,j and thus has a space complexity O(n (K + dg) + L (dg) + K E ), ou poposed memoy-efficient method can damatically educe the memoy consumption in pactice, as E is usually lage and gowing ove time. Fo example, in ou expeiments on the NYC dataset (n = 3173, K = 100, L = 9, d = 10, g = 50 and E is about 2 million), the memoy consumption of ou poposed method is only 1% of the memoy equied by ou pevious wok HistoSketch. 5 EXPERIMENTAL EVALUATION In this section, we evaluate D 2 HistoSketch on multiple classification tasks using both synthetic and eal-wold datasets. In the following, we fist pesent ou expeimental setup, followed by the esults on both types of datasets. 5.1 Expeimental Setup To evaluate the pefomance of ou similaity-peseving sketches, we pefom classification tasks based on these sketches in diffeence scenaios, which is a common evaluation scheme fo similaity peseving sketching methods [4], [11], [13], [14]. Specifically, based on labeled steaming histogams, we ty to classify those histogam instances without labels. We use a KNN classifie [39] which can always take the most up-to-date taining data (sketches) fo classification without maintaining a built classification model. Such Fig. 5. Pobability of steaming histogam elements geneated fom its initial distibution fo the synthetic dataset. a popety fits ou case of classifying steaming histogams with continuously incoming histogam elements, whee the sketches ae continuously updated accodingly. In addition, using a simple KNN classifie, we can put moe focus on the distance measues athe than classifies. We empiically set KNN to conside the five neaest neighbos. We conside the following evaluation scenaios: Synthetic Dataset. Synthetic data is widely used in studying concept dift adaptation [19], [40]. The advantage is that we can simulate diffeent cases of concept dift in steaming histogams with contollable paametes. Typical methods of simulating data steams with concept dift often use a moving hypeplane to geneate a steam of complete data instances [40]. Howeve, it cannot be diectly adopted fo steaming histogams, as the elements of a histogam ae obseved in a steaming manne. Theefoe, we design ou own data simulation method. Specifically, we conside two Gaussian distibutions N (100, 20) and N (110, 20) epesenting two classes of histogams, espectively. The steaming histogam elements ae then geneated as the neaest integes of the andom numbes sampled fom those distibutions. Fo each class, we simulate 500 histogams with 1000 elements each. We then split the 500 histogams to 50%-50% fo taining and the testing, espectively. The histogam elements ae geneated in a andom ode. To simulate concept-dift issues, we conside both abupt and gadual dift cases [19] in testing data. Fo abupt dift, stating fom 25% of steaming histogam elements, the testing data of one distibution abuptly stats to eceive the histogam elements geneated fom the othe distibution, and also changes thei labels immediately. Fo gadual dift, fom 25% to 35% of steaming histogam elements, the testing data of one distibution gadually stats to eceive the histogam elements geneated fom the othe distibution with an inceasing pobability (fom 0 to 1). The labels of the testing data also change gadually fom one class to the othe, i.e., fom 0% to 100%. Fig. 5 shows the pobability of steaming histogam elements geneated fom its initial distibutions. Note that it is complementay to the pobability of histogam elements geneated fom the othe distibution. POI Dataset. Ou sketching method can be applied to solve the poblem of semantic place labeling [4], whee we want to infe a place s categoy (e.g., supemaket o ba) based on its customes visiting pattens (i.e., the steaming histogam of its customes visits). The basic intuition is that POIs of the diffeent type usually have diffeent tempoal visiting pattens, e.g., bas ae mostly visited duing the night while museums ae often visited duing the daytime. Pevious studies have shown that consideing usetime pais as histogam elements (i.e., fine-gained visiting

9,160 15,479 pattens) yields much highe accuacy than consideing only time (i.e., coase-gained visiting pattens) [4]. We thus conside fine-gained pattens in the following.

10 JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, AUGUST TABLE 1 POI dataset statistics Dataset New Yok City Tokyo Istanbul (NYC) (TKY) (IST) Numbe of check-ins 142, , ,771 Numbe of POIs 3,174 2,993 3,120 Numbe of uses 12,798 9,160 15,479 pattens) yields much highe accuacy than consideing only time (i.e., coase-gained visiting pattens) [4]. We thus conside fine-gained pattens in the following. Specifically, fo one use s visit to a POI, we fist map the visiting time onto one of the 168 hous in a week peiod (discetization of time), and then conside the use-time pai as a histogam element. With a lage and continuously inceasing numbe of uses ove time, the cadinality of the steaming histogam apidly inceases. Moe impotantly, the visiting patten of a POI may change both abuptly (e.g., caused by the change of POI type) and gadually (e.g., caused by the intoduction of new menu items in a estauant). To evaluate ou method using this task, we use a dataset fom Fousquae povided by [3], [23], which has been widely used to study the poblems of semantic place labeling [4], [14] and POI ecommendation [41], [42], [43]. The dataset contains use check-in data on POIs fo about two yeas (fom Apil 2012 to Mach 2014). Each check-in ecods one visit of a use to a POI (with the associated categoy) at a cetain time. We andomly select 20% of the POIs as unlabeled testing data and egad the est as taining data. The classification is pefomed at the end of each month on the second yea (in ode to avoid too few check-ins fo some POIs duing the fist yea). The categoies (labels) of POIs in the dataset ae classified by Fousquae into 9 oot categoies (i.e., Ats & Entetainment, College & Univesity, Food, Geat Outdoos, Nightlife Spot, Pofessional & Othe Places, Residence, Shop & Sevice, Tavel & Tanspot), which ae futhe classified into 291 sub-categoies. Without loss of geneality, we select thee big cities, New Yok City, Tokyo and Istanbul, fo ou expeiments. Table 1 summaizes the main chaacteistics of ou dataset. Movielens Dataset. Ou sketching method can also be applied to pedict movie gene based on uses tagging activity steams. Specifically, on a movie ecommendation platfom, a use can add tags to a movie. Subsequently, we chaacteize each movie using the histogam of use tagging activities, whee each use is a histogam element. Simila to the case of POI dataset, we ae then able to classify a movie in to diffeent genes based on its steaming histogam. We use the MovieLens 20M Dataset povided by [44]. This dataset contains movie tagging activity about 10 yeas. We select top 8 pimay genes (i.e., action, adventue, animation, childen, comedy, Cime, documentay, Dama ), and keep only the tagging data on the elated movies. Finally, we obtain 24,394 movies, 7,622 uses, 434,054 tagging activities. We andomly select 20% of the movies as unlabeled testing data and egad the est as taining data. The classification is pefomed at the end of 6th to 10th yeas. (a) Abupt dift Fig. 6. Pefomance on classification task. 5.2 Pefomance on Synthetic Dataset (b) Gadual dift In this section, we fist show the effectiveness of D 2 HistoSketch by compaing the classification pefomance using diffeent steaming histogams and thei coesponding sketches. Aftewads, we study the impact of the sketch length K, the weight decay facto λ, and the count-min sketch paametes in diffeent scenaios Pefomance on classification task To demonstate the advantages of consideing disciminative and gadual fogetting weights in steaming histogams and the similaity-peseving sketches, we compae the following fou methods that keep the full histogams in memoy, and also thei coesponding sketching methods. We set sketch length K = 100 fo all the sketching methods, use entopy weighting fo disciminative weights, and set λ = 0.02 fo gadual fogetting weights when applicable. Histogam-Classical: full histogams whee the histogam s elements ae unweighted. Histogam-Disciminative: full histogams whee the histogam s elements ae assigned with disciminative weights. Histogam-Fogetting: full histogams whee the histogam s elements ae assigned with gadual fogetting weights. D 2 Histogam: full histogams whee the histogam s elements ae assigned with both disciminative and gadual fogetting weights. Sketch-Classical: ou sketching method configued with unweighted histogam elements (appoximation of Histogam-Classical). Sketch-Disciminative: the sketching method with only disciminative weights (appoximation of Histogam- Disciminative), which is equivalent to POISketch [4]. Sketch-Fogetting: the sketching method with only gadual fogetting weights (appoximation of Histogam-Fogetting), which is equivalent to HistoSketch [14]. D 2 HistoSketch: ou poposed sketching method with both disciminative and gadual fogetting weights (appoximation of D 2 Histogam). Fig. 6 shows the classification accuacy ove time (numbe of steaming elements pe histogam) fo both abupt and gadual difts. Fist, compaing the accuacy of using full histogams, we obseve that ou D 2 Histogam achieves the best esults oveall. On one hand, consideing gadual fogetting weights can significantly educe the ecovey time fom concept dift. Fo example, in the case of abupt

JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11 (a) Abupt dift Fig. 7. Impact of sketch length K. (b) Gadual dift (a) Abupt dift Fig. 8. Impact of weight decay facto λ.

On the othe hand, disciminative weights can effectively impove accuacy outside of the peiod fo concept dift adaptation (i.e., befoe concept dift and afte ecovey fom concept dift).

Second, among all the sketching methods, the poposed D 2 HistoSketch achieves the best esults, as it can efficiently appoximate the similaity between D 2 Histogam with both disciminative and gadual

9% compaed to Sketch-Fogetting. 5.2.2 Impact of sketch length K The sketch length K influences how well the sketch can appoximate the similaity with the oiginal data.

11 JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, AUGUST (a) Abupt dift Fig. 7. Impact of sketch length K. (b) Gadual dift (a) Abupt dift Fig. 8. Impact of weight decay facto λ. (b) Gadual dift dift (stating fom 250th histogam elements), accuacy ecoves afte eceiving 450 and 150 elements fo Histogam- Disciminative and D 2 Histogam, espectively. On the othe hand, disciminative weights can effectively impove accuacy outside of the peiod fo concept dift adaptation (i.e., befoe concept dift and afte ecovey fom concept dift). Fo example, fo both cases of abupt and gadual difts, D 2 Histogam shows a 2.6% impovement of accuacy fom Histogam-Fogetting. Second, among all the sketching methods, the poposed D 2 HistoSketch achieves the best esults, as it can efficiently appoximate the similaity between D 2 Histogam with both disciminative and gadual fogetting weights. Specifically, D 2 HistoSketch quickly adapts to concept dift (showing simila adaptation speed as that of D 2 Histogam), and shows a significant accuacy impovement of 7.9% compaed to Sketch-Fogetting Impact of sketch length K The sketch length K influences how well the sketch can appoximate the similaity with the oiginal data. In this expeiment, by fixing the gadual fogetting weight decay facto λ = 0.02, we vay the sketch length K within [20, 50, 100, 200, 500, 1000] to investigate the pefomance of ou method. We also plot the esults fom Histogam- Classical and D 2 Histogam as efeences. Fig. 7 shows the classification accuacy ove time in both cases of abupt and gadual dift. Fist, we obseve a positive impact of sketch length K on the classification accuacy, i.e., lage values of K imply a highe accuacy, as longe sketches can peseve moe infomation and thus bette appoximate the similaities of the D 2 Histogam. The accuacy flattens out afte K = 500 (D 2 HistoSketch with K 500 ae highly ovelapped with D 2 Histogam), indicating that a sketch of length 500 is sufficient fo accuate similaity appoximation. Second, we find that the sketch length K has no obvious impact on the adaptation speed, which is actually contolled by the gadual fogetting weight decay facto λ (we will discuss this in the next expeiment) Impact of weight decay facto λ The weight decay facto λ balances the tade-off between the concept dift adaptation speed and the similaity appoximation pefomance. In this expeiment, by fixing the sketch length K = 100, we vay the weight decay facto λ within [0, 0.005, 0.01, 0.02, 0.05, 0.1] to investigate the pefomance of ou method. We also compae ou method with the following two methods: Histogam-LatestK whee the histogam is built with the latest K histogam elements fom the steam, which is a typical method fo abupt fogetting (i.e., sliding window based concept-dift adaptation) [17]. It can also be egaded as a sketching method in the sense that the latest K histogam elements (unweighted) ae the sketches to epesent the histogam. Histogam-LatestK-Disciminative whee we futhe incopoate disciminative weights (i.e., entopy weights as fo ou method) into Histogam-LatestK. Fig. 8 shows the classification accuacy ove time in both cases of abupt and gadual dift. Fist, by compaing the esults of diffeent weight decay factos, we obseve clealy the tade-off between the concept dift adaptation speed and classification accuacy. On one hand, lage λ values imply faste adaptation to concept dift, as the algoithm quickly fogets outdated data (i.e., it puts lowe weights on the outdated data). On the othe hand, lage values of λ lead to lowe classification accuacy, as the algoithm uses less infomation fom fome histogam elements fo sketching, which leads to wose similaity appoximation. Second, compaed to the two additional baselines, we find that ou D 2 HistoSketch-λ-0.05 outpefoms Histogam-LatestK by achieving highe accuacy and faste adaptation speed at the same time. Howeve, we found that Histogam-LatestK- Disciminative shows compaable esults to ou method, i.e., its adaptation speed is faste than D 2 HistoSketch-λ and slowe than D 2 HistoSketch-λ-0.05 while its accuacy is lowe than D 2 HistoSketch-λ-0.02 and highe than D 2 HistoSketch-λ Despite such simila esults, thee ae two obvious advantages of ou methods: 1) D 2 HistoSketch is able to balance the adaptation speed and the accuacy unde fixed-size sketches while Histogam-LatestK needs to vay sketch length K to tune such tade-off; 2) ou method is much faste in similaity computation than Histogam- LatestK-Disciminative, as the latte equies set opeations while D 2 HistoSketch only elies on Hamming distance. Fo example, to classify one histogam using ou testing PC 1, ou method needs only 13ms while Histogam-LatestK takes 152ms, which shows a 12x speedup Impact of count-min sketch paametes In this expeiment, we study the impact of the count-min sketch paametes on the classification pefomance. Specifically, we study the classification accuacy of ou method by vaying the numbe of ows d and the numbe of column g within [5, 10, 20, 50, 100]. 1. Intel Coe i7-4770hq@2.20ghz, 16GB RAM, Mac OS X, implementation using MATLAB v2014b.

JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 12 (a) Abupt dift Fig. 9. Impact of count-min sketch paametes. (b) Gadual dift Fig.

We obseve that lage d and g can both lead to a highe accuacy, as they can bette appoximate the fequency table of histogam V and the label fequency vecto F l.

Theefoe, as the accuacy flattens out afte d = 10, g = 50

12 JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, AUGUST (a) Abupt dift Fig. 9. Impact of count-min sketch paametes. (b) Gadual dift Fig. 9 shows the aveage classification accuacy outside of the peiod fo concept dift adaptation. We obseve that lage d and g can both lead to a highe accuacy, as they can bette appoximate the fequency table of histogam V and the label fequency vecto F l. Howeve, lage d and g will also lead to highe time and space complexity as discussed in Section Theefoe, as the accuacy flattens out afte d = 10, g = 50, we empiically set the count-min sketch paametes d = 10, g = 50 fo all the expeiments. 5.3 Pefomance on Real-Wold Datasets To evaluate ou method using eal-wold datasets, we fist compae it with state-of-the-at appoaches, and then show its classification accuacy ove time, followed by its untime pefomance Compaison with othe methods We compae D 2 HistoSketch with all afoementioned baselines. The sketch length K is empiically set to 100 fo all elated methods. Fig. 10 plots the aveage classification accuacy. Fist, we obseve that D 2 Histogam yields the highest accuacy, showing the effectiveness of consideing both disciminative and gadual fogetting weights on steaming histogam elements. Thid, ou D 2 HistoSketch also outpefoms othe sketching methods by efficiently appoximating similaity between D 2 Histogam. In paticula, D 2 HistoSketch can efficiently appoximate the similaity with only a small loss of classification accuacy (e.g., about 3.25% fo oot POI categoies on the NYC dataset) compaed to D 2 Histogam. Finally, D 2 HistoSketch also outpefoms Histogam-LatestK/Histogam-LatestK- Disciminative, showing the effectiveness of gadual (athe than abupt) fogetting of histoical data. In addition, compaing the esults between the POI dataset and the Movielens dataset, we find that the impovement by consideing gadual fogetting (eithe in histogams and sketches) on Movielens dataset is smalle than on POI dataset. It is due to the fact that concept-dift is aely obseved in the Movielens dataset. In othe wods, compaed to POIs, a movie aely changes its gene Classification accuacy ove time In this expeiment, we study the classification accuacy of ou sketching method ove time. We focus on the NYC dataset with the 9 oot levels of POI categoies (Expeiments on the othe datasets show simila esults). In addition to the andomly selection stategies of testing POIs, we futhe conside only the POIs with categoy changes as testing data, as the change of POI categoies will likely lead to concept dift (paticulaly of the abupt kind). We keep the same paamete setting as in the pevious expeiment. Fig. 10. Compaison with othe methods (a) Random testing POIs Fig. 11. Classification pefomance ove time (b) Categoy-changing POIs Fig. 11(a) shows the classification accuacy fo each of the 12 testing months. We obseve that the accuacy of all sketching methods slightly inceases ove time with the accumulated histogam elements, as obseving moe histogam elements leads to moe accuate similaity measuement of steaming histogams. In paticula, compaed to othe sketching methods, ou D 2 HistoSketch achieves consistently highe accuacy by consideing both disciminative and gadual fogetting weights. Fig. 11(b) shows the same esults on the testing POIs with categoy changes only. We obseve a lage impovement of consideing gadual fogetting weights than that in Fig. 11(a). Fo example, the accuacy gain of ou method ove Sketch-Disciminative is 3.03% when testing on categoy-changing POIs, while it is 2.13% when testing on andom POIs. This obsevation futhe shows the effectiveness of ou method at handling concept dift. Note that as opposed to the synthetic dataset, we do not obseve any sudden dop of accuacy, as sets of POIs aely change thei types simultaneously Runtime pefomance fo classification In this expeiment, we investigate the untime pefomance of sketch-based classification. Moe pecisely, we evaluate the classification time w..t. sketch length K. We also compae with Histogam-LatestK which can be egaded as a sketching method as it keeps only the latest K histogam elements. Fig. 12(a) plots the KNN classification time (log scale) on ou test PC 3. We obseve that compaed to the full histogams (D 2 Histogam), using D 2 HistoSketch damatically educes the classification time (with a 7500x speedup), since the Hamming distance between sketches of size K (e.g., K=100 in pevious expeiments) can be much moe efficiently computed than the nomalized min-max similaity between the full histogams of much lage size E (e.g., E is about 2 million fo the NYC dataset). We also note that all the sketching methods have simila classification time, as they all compute the Hamming distance between vectos

JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 13 (a) Classification time Fig. 12. Runtime pefomance (b) Pocessing speed of size K.

4 Runtime pefomance fo sketch maintenance In this expeiment, we investigate the untime pefomance of sketch maintenance. We evaluate the steaming histogam pocessing speed w..t. sketch length K (as the time complexity of maintaining D 2 HistoSketch mainly depends on the sketch length K).

13 JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, AUGUST (a) Classification time Fig. 12. Runtime pefomance (b) Pocessing speed of size K. Compaed to Histogam-LatestK, ou method also shows a 15x speedup, as Histogam-LatestK equies set opeations fo similaity computation Runtime pefomance fo sketch maintenance In this expeiment, we investigate the untime pefomance of sketch maintenance. We evaluate the steaming histogam pocessing speed w..t. sketch length K (as the time complexity of maintaining D 2 HistoSketch mainly depends on the sketch length K). Hee we compae ou method with the following two methods: 1) HistoSketch-Oiginal [14] (consideing gadual fogetting weights only) whose sampling pocess is implemented with Eq 10 and 9 and equies the lage set of in-memoy paametes i,j, c i,j, β i,j ; 2) HistoSketch-Fast which is implemented with ou efficient sampling method as show in Eq. 21 and consideing gadual fogetting weights only (fo a fai compaison with HistoSketch-Oiginal). Fig. 12(b) shows the pocessing speed of steaming histogam elements. Fist, compaed to HistoSketch-Oiginal, HistoSketch-Fast shows a highe pocessing speed (with a 12.2% impovement ove HistoSketch-Oiginal). Note that these two methods ae equivalent in peseving similaities of steaming histogams with gadual fogetting weight only, and thus show exactly the same classification pefomance. Theefoe, ou poposed sampling method is moe untime-efficient than the sampling method in HistoSketch-Oiginal. Second, by adding the disciminative weights, D 2 HistoSketch yields a slightly slowe pocessing speed than HistoSketch-Fast, but with highe classification pefomance. Finally, compaed to HistoSketch-Oiginal, D 2 HistoSketch can achieve a faste pocessing speed (with a 5.9% impovement on aveage) and a highe classification accuacy (with a 2.78% impovement on aveage) at the same time. We also find that the pocessing speed slightly deceases with inceasing sketch lengths fo all the methods, as longe sketches equied a little moe time to update. Moeove, we believe that the pocessing speed of ou method (about 2200 pe second) is able to handle most eal-wold use cases. Fo example, Fousquae check-in steams achieved a peak-day ecod of 7 million check-ins/day in 2015 (about 81 check-ins/sec on aveage). In addition, ou method can be easily paallelized w..t. the numbe of histogams (i.e., numbe of POIs), as sketches of steaming histogams ae independently maintained fom each othe. 6 CONCLUSION AND FUTURE WORK This pape intoduces D 2 HistoSketch, a similaitypeseving sketching method fo steaming histogams to efficiently appoximate thei Disciminative and Dynamic similaity. D 2 HistoSketch is designed to fast and memoy-efficiently maintain a set of compact and fixedsized sketches of steaming histogams to appoximate thei nomalized min-max similaity. To povide high-quality similaity appoximations, D 2 HistoSketch consides both disciminative and gadual fogetting weights fo similaity measuement, and seamlessly incopoates them in the sketches. Based on both synthetic and eal-wold datasets, ou empiical evaluation shows that ou method is able to efficiently and effectively appoximate the similaity between steaming histogams while outpefoming stateof-the-at sketching methods. Compaed to full steaming histogams with both disciminative and gadual fogetting weights in paticula, D 2 HistoSketch is able to damatically educe the classification time (with a 7500x speedup) at the expense of a small loss in accuacy only (about 3.25%). As futue wok, we will exploe the poblem of sketching two-dimensional (bivaiate) steaming histogams, and apply ou method to othe application domains. ACKNOWLEDGMENTS This poject has eceived funding fom the Euopean Reseach Council (ERC) unde the Euopean Union s Hoizon 2020 eseach and innovation pogamme (gant ageement /GaphInt), and Fudan Univesity Statup Reseach Gant (SXH ). REFERENCES [1] O. Chapelle, P. Haffne, and V. N. Vapnik, Suppot vecto machines fo histogam-based image classification, IEEE Tans. on Neual Netwoks, vol. 10, no. 5, pp , [2] L. D. Bake and A. K. McCallum, Distibutional clusteing of wods fo text classification, in Poc. of SIGIR, 1998, pp [3] D. Yang, D. Zhang, L. Chen, and B. Qu, Nationtelescope: Monitoing and visualizing lage-scale collective behavio in lbsns, Jounal of Netwok and Compute Applications, vol. 55, pp , [4] D. Yang, B. Li, and P. Cudé-Mauoux, Poisketch: Semantic place labeling ove use activity steams, in Poc. of IJCAI, 2016, pp [5] J. Wang, T. Zhang, J. Song, N. Sebe, and H. T. Shen, A suvey on leaning to hash, IEEE Tansactions on Patten Analysis and Machine Intelligence, vol. 40, no. 4, pp , [6] C. C. Aggawal and P. S. Yu, On classification of high-cadinality data steams, in Poc. of SDM, 2010, pp [7] Y. Bachach, E. Poat, and J. S. Rosenschein, Sketching techniques fo collaboative filteing, in Poc. of IJCAI, 2009, pp [8] A. Z. Bode, M. Chaika, A. M. Fieze, and M. Mitzenmache, Min-wise independent pemutations, in Poc. of STOC, 1998, pp [9] M. Mitzenmache, R. Pagh, and N. Pham, Efficient estimation fo high similaities using odd sketches, in Poc. of WWW, 2014, pp [10] K. Kutzkov, M. Ahmed, and S. Nikitaki, Weighted similaity estimation in data steams, in Poc. of CIKM, 2015, pp [11] P. Li, 0-bit consistent weighted sampling, in Poc. of KDD, 2015, pp [12] S. Ioffe, Impoved consistent sampling, weighted minhash and l1 sketching, in Poc. of ICDM, 2010, pp [13] W. Wu, B. Li, L. Chen, and C. Zhang, Consistent weighted sampling made moe pactical, in Poc. of WWW, 2017, pp [14] D. Yang, B. Li, L. Rettig, and P. Cudé-Mauoux, Histosketch: Fast similaity-peseving sketching of steaming histogams with concept dift, in Poc. of ICDM, New Oleans, USA, 2017.

JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 14 [15] Y. Yang, F. Liang, N. Jojic, S. Yan, J. Feng, and T. S. Huang, Disciminative similaity fo clusteing and semi-supevised leaning, axiv:1709.

Mohi, A eview and empiical evaluation of featue weighting methods fo a class of lazy leaning algoithms, Atif. Intell. Rev., vol. 11, no. 1-5, pp. 273 314, 1997. [17] J. Gama, I. Žliobaitė, A.

of ECAI Wokshop, 2000, pp. 101 107. [19] R. Klinkenbeg, Leaning difting concepts: Example selection vs. example weighting, Intelligent Data Analysis, vol. 8, no. 3, pp. 281 300, 2004. [20] M.

14 JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, AUGUST [15] Y. Yang, F. Liang, N. Jojic, S. Yan, J. Feng, and T. S. Huang, Disciminative similaity fo clusteing and semi-supevised leaning, axiv: [stat.ml], [Online]. Available: [16] D. Wettscheeck, D. W. Aha, and T. Mohi, A eview and empiical evaluation of featue weighting methods fo a class of lazy leaning algoithms, Atif. Intell. Rev., vol. 11, no. 1-5, pp , [17] J. Gama, I. Žliobaitė, A. Bifet, M. Pechenizkiy, and A. Bouchachia, A suvey on concept dift adaptation, ACM Computing Suveys, vol. 46, no. 4, p. 44, [18] I. Koychev, Gadual fogetting fo adaptation to concept dift. Poc. of ECAI Wokshop, 2000, pp [19] R. Klinkenbeg, Leaning difting concepts: Example selection vs. example weighting, Intelligent Data Analysis, vol. 8, no. 3, pp , [20] M. Manasse, F. McShey, and K. Talwa, Consistent weighted sampling, Technical Repot MSR-TR , [21] M. O. Wad, G. Ginstein, and D. Keim, Inteactive data visualization: foundations, techniques, and applications. CRC Pess, [22] D. Yang, D. Zhang, V. W. Zheng, and Z. Yu, Modeling use activity pefeence by leveaging use spatial tempoal chaacteistics in lbsns, IEEE Tansactions on Systems, Man, and Cybenetics: Systems, vol. 45, no. 1, pp , [23] D. Yang, D. Zhang, and B. Qu, Paticipatoy cultual mapping based on collective behavio data in location-based social netwoks, ACM Tans. on Intelligent Systems and Technology, vol. 7, no. 3, p. 30, [24] M. Chaika, K. Chen, and M. Faach-Colton, Finding fequent items in data steams, in Intenational Colloquium on Automata, Languages, and Pogamming. Spinge, 2002, pp [25] G. Comode and S. Muthukishnan, An impoved data steam summay: the count-min sketch and its applications, J. Algoithms, vol. 55, no. 1, pp , [26] Y. Ben-Haim and E. Tom-Tov, A steaming paallel decision tee algoithm, Jounal of Machine Leaning Reseach, vol. 11, no. Feb, pp , [27] B. Li, X. Zhu, L. Chi, and C. Zhang, Nested subtee hash kenels fo lage-scale gaph classification ove steams, in Poc. of ICDM, 2012, pp [28] A. Gionis, P. Indyk, R. Motwani et al., Similaity seach in high dimensions via hashing, in Poc. of VLDB, vol. 99, no. 6, 1999, pp [29] J. Song, T. He, L. Gao, X. Xu, A. Hanjalic, and H. Shen, Binay geneative advesaial netwoks fo image etieval, in Poceedings of the AAAI Confeence on Atificial Intelligence, 2018, pp [30] J. Song, H. Zhang, X. Li, L. Gao, M. Wang, and R. Hong, Selfsupevised video hashing with hieachical binay auto-encode, IEEE Tansactions on Image Pocessing, vol. 27, no. 7, pp , [31] J. Song, L. Gao, L. Liu, X. Zhu, and N. Sebe, Quantizationbased hashing: a geneal famewok fo scalable image and video etieval, Patten Recognition, vol. 75, pp , [32] B. Haeuple, M. Manasse, and K. Talwa, Consistent weighted sampling made fast, small, and easy, axiv: [cs.ds], [Online]. Available: [33] C. Yang, R. Duaiswami, and L. Davis, Efficient mean-shift tacking via a new similaity measue, in Poc. of CVPR, vol. 1. IEEE, 2005, pp [34] P. Nakov, A. Popova, and P. Mateev, Weight functions impact on lsa pefomance, EuoConfeence RANLP, pp , [35] A. Tsymbal, The poblem of concept dift: definitions and elated wok, Technical Repot TCD-CS Compute Science Depatment, [36] Y. Koen, Collaboative filteing with tempoal dynamics, Communications of the ACM, vol. 53, no. 4, pp , [37] R. L. Gaham, Concete mathematics: a foundation fo compute science. Peason Education India, [38] G. Comode and S. Muthukishnan, Appoximating data with the count-min data stuctue, IEEE Softwae, [39] T. M. Mitchell, Machine leaning. pp. I XVII, [40] H. Wang, W. Fan, P. S. Yu, and J. Han, Mining concept-difting data steams using ensemble classifies, in Poc. of KDD, 2003, pp [41] D. Yang, D. Zhang, Z. Yu, and Z. Wang, A sentiment-enhanced pesonalized location ecommendation system, in Poce. of HT, 2013, pp [42] D. Yang, D. Zhang, Z. Yu, and Z. Yu, Fine-gained pefeenceawae location seach leveaging cowdsouced digital footpints fom lbsns, in Poceedings of the 2013 ACM intenational joint confeence on Pevasive and ubiquitous computing. ACM, 2013, pp [43] Z. Yu, H. Xu, Z. Yang, and B. Guo, Pesonalized tavel package with multi-point-of-inteest ecommendation based on cowdsouced use footpints, IEEE Tansactions on Human-Machine Systems, vol. 46, no. 1, pp , [44] F. M. Hape and J. A. Konstan, The movielens datasets: Histoy and context, ACM Tansactions on Inteactive Intelligent Systems (TiiS), vol. 5, no. 4, p. 19, pediction. Dingqi Yang is cuently a senio eseache at the Univesity of Fiboug in Switzeland. He eceived the Ph.D. degee in compute science fom Piee and Maie Cuie Univesity and Institut Mines-TELECOM/TELECOM SudPais, whee he won both the CNRS SAMOVAR Doctoate Awad and the Institut Mines-TELECOM Pess Mention in His eseach inteests include big social media data analytics, ubiquitous computing, and smat city applications. Bin Li eceived the Ph.D. degee in compute science fom Fudan Univesity, Shanghai, China. He was a Senio Reseach Scientist with Data61 (fomely NICTA), CSIRO, Eveleigh, NSW, Austalia and a Lectue with the Univesity of Technology Sydney, Boadway, NSW, Austalia. He is cuently an Associate Pofesso with the School of Compute Science, Fudan Univesity, Shanghai, China. His cuent eseach inteests include machine leaning and data analytics, paticulaly in complex data epesentation, modeling, and Laua Rettig is a Ph.D. student at the Univesity of Fiboug, Switzeland unde the supevision of Philippe Cudé-Mauoux. He aeas of inteest ae big data infastuctues, semantic data, entity linking, data steams, and deep leaning. She eceived he Maste s degee fom the Univesity of Fiboug, with he thesis witten on big data steaming using eal-wold telecommunication data duing an intenship at Swisscom, fo whch she won two best thesis awads at the Univesity of Fiboug. Philippe Cude-Mauoux is a Full Pofesso and the diecto of the exascale Infolab at the Univesity of Fiboug in Switzeland. He eceived his Ph.D. fom the Swiss Fedeal Institute of Technology EPFL, whee he won both the Doctoate Awad and the EPFL Pess Mention. Befoe joining the Univesity of Fiboug he woked on infomation management infastuctues fo IBM Watson Reseach, Micosoft Reseach Asia, and MIT. His eseach inteests ae in next-geneation, Big Data management infastuctues fo non-elational data. Webpage:

Journal of World s Electrical Engineering and Technology J. World. Elect. Eng. Tech. 1(1): 12-16, 2012

Journal of World s Electrical Engineering and Technology J. World. Elect. Eng. Tech. 1(1): 12-16, 2012 2011, Scienceline Publication www.science-line.com Jounal of Wold s Electical Engineeing and Technology J. Wold. Elect. Eng. Tech. 1(1): 12-16, 2012 JWEET An Efficient Algoithm fo Lip Segmentation in Colo