Multi-modal Clustering for Multimedia Collections

Size: px

Start display at page:

Download "Multi-modal Clustering for Multimedia Collections"

Peter Houston
6 years ago
Views:

1 Multi-modal Clustering for Multimedia Colletions Ron Bekkerman and Jiwoon Jeon Center for Intelligent Information Retrieval University of Massahusetts at Amherst, USA {ronb Abstrat Most of the online multimedia olletions, suh as piture galleries or video arhives, are ategorized in a fully manual proess, whih is very expensive and may soon be infeasible with the rapid growth of multimedia repositories. In this paper, we present an effetive method for automating this proess within the unsupervised learning framework. We exploit the truly multi-modal nature of multimedia olletions they have multiple views, or modalities, eah of whih ontributes its own perspetive to the olletion s organization. For example, in piture galleries, image aptions are often provided that form a separate view on the olletion. Color histograms (or any other set of global features) form another view. Additional views are blobs, interest points and other sets of loal features. Our model, alled Comraf* (pronouned Comraf-Star), effiiently inorporates various views in multi-modal lustering, by whih it allows great modeling flexibility. Comraf* is a light-weight version of the reently introdued ombinatorial Markov random field (Comraf). We show how to translate an arbitrary Comraf into a series of Comraf* models, and give an empirial evidene for omparable effetiveness of the two. Comraf* demonstrates exellent results on two real-world image galleries: it obtains times higher auray ompared with a uni-modal k-means. 1. Introdution Clustering is the ore omponent in many data management systems. Reent explosion of multimedia information on the WWW demands new lustering tehniques that an seamlessly handle omplex strutures of multimedia data. Multimedia information is intrinsially multi-modal. The word modality an be interpreted in many different ways in this paper, modality means the type of input. For example, image aptions and olor histograms are different types of input to an image proessing system and therefore we onsider them as two separate modalities. Eah modality gives a different aspet of the data, and eah modality has its own dependeny relationship with other modalities. For example, image aptions tend to desribe events aptured on the image, while annotations usually list salient objets in the image. Color histograms and texture features onvey visual information to the system. In this paper, we develop an effetive multimedia information lustering system based on the reently introdued ombinatorial Markov random field (Comraf) [4]. Comraf is a speial type of a Markov random field (MRF), designed to be a onvenient framework for multi-modal learning in general, and for multi-modal lustering in partiular. In Comrafs, eah data modality is represented by a single node, orresponding to a random variable of rih struture (alled a ombinatorial random variable); undireted edges are drawn between modalities that stay in a statistial interation with eah other. We fous on multi-modal lustering of image olletions, when multiple views on the olletion are available. Image lustering an be a useful omponent in a retrieval system [7], it an also be a stand-alone appliation, for example, for onstruting semanti groups of image retrieval results [19], or for browsing image olletions [1]. Being fully unsupervised, existing lustering methods often demonstrate poor performane. We show that by employing the multi-modal learning paradigm we an signifiantly improve lustering results. Multiple modalities are a heap form of supervision: while it is expensive to reate a large dataset, eah element of whih is labeled with its semanti ategory, it is usually straightforward to obtain another, orthogonal type of labels, over the data modalities. The idea of lustering images using both low-level image features and surrounding text (i.e. grouping together visually similar and semantially related images) has attrated the lose attention of researh ommunity. Barnard et al. [1] propose a generative hierarhial model for image lustering, in whih every node generates words and blobs based on the given probability distributions for that node. Higher level nodes generate more general terms and lower level nodes generate more speifi terms. The EM algorithm is used to fit the model. This approah an handle /07/$ IEEE

2 only two feature types (words, blobs); to handle more types, the model and the learning proedure must be revised. Cai et al. [6] luster Web image searh results using visual, textual and link analysis. They extrat text relevant to the image using a vision-based page segmentation algorithm. First, only text and hyperlink data is used to luster images. The resulting lusters are lustered again using low-level image features. Loeff et al. [16] apply a similar approah: they alulate a histogram of gradient magnitude of the pixel values from every interest point and then luster images using these loal features with global olor histograms and surrounding text. Both Cai et al. and Loeff et al. use spetral lustering methods, where the affinity sores for every pair of images and every modality must be alulated, resulting in an unrealistially heavy representation. Bipartite spetral graph partitioning [8] is useful for olustering two modalities suh as douments and words. Gao et al. [11] extend this method to handle one more modality. In their tripartite graph model, nodes are arranged in three layers: words, images and image features. To handle more modalities, Gao et al.[12] propose another method that is most losely related to our work: they organize modalities in a star struture of interrelationships, where a entral modality is onneted to all the others. They treat this problem as fusion of multiple pairwise o-lustering problems. Eah sub-problem is solved using the bipartite graph partitioning method. Our approah has a few advantages over the others. First, our method has no pratial limitation in the number of modalities as long as the pairwise interation data is available addition of a modality inreases the omputational omplexity only linearly. Seond, our model an luster multiple modalities while taking into aount other modalities, whih do not have to be lustered. Third, our information-theoreti lustering method does not rely on hard-to-obtain affinity matries of individual modalities. Instead, easily omputable ontingeny tables of interating modalities are used. Overall, our paper proposes a general framework for lustering multimedia olletions, whih an be straightforwardly applied to video data, sound traks, hypertext et. as well as to any of their ombinations. 2. Combinatorial Markov Random Fields Definition 1 A ombinatorial random variable X is a disrete random variable defined over a ombinatorial set (i.e. a set of all subsets, partitionings, partial orderings et. of a given finite set). In this paper, we fous on the task of hard lustering, or partitioning, of a given set. For this task, we define a ombinatorial random variable over all the possible partitionings of the set. Note that a ombinatorial random variable, while being an ordinary disrete random variable with a finite domain, has a unique property: in most real-world ases, the event spae of X is so large that the distribution P (X ) annot be expliitly speified. For example, onsider a disrete random variable X with three values {red, green, blue} seleted aording to a probability mass funtion { 1 2, 1 3, 1 6 }. A ombinatorial random variable X an then take five values: {{red, green, blue}}, {{red, green}, {blue}}, {{red, blue}, {green}}, {{green, blue}, {red}}, and {{red}, {green}, {blue}}. Assume that the underlying probability mass funtion of X is { 1 5, , 1 5, 1 5, }, whih is unknown to the pratitioner. If the task is to find the most probable partitioning, a long sampling proess should be applied. This proess will be infeasibly long if the event spae of X onsists of, say, one hundred values. For eah partitioning x, we find it useful (see Setion 2.2) to define a disrete random variable X over lusters in x. Continuing our example above, for x = {{red, blue}, {green}} the variable X takes two values: {red, blue} and {green}. Note that the probability of a luster is a sum of probabilities of its elements, so that the probability mass funtion of X here is { , 1 3 }. Definition 2 A ombinatorial Markov random field (Comraf) is an MRF, at least one node of whih is a ombinatorial random variable. In this paper we onsider only Comraf models, eah node of whih is a ombinatorial random variable. Beause of the uniqueness of ombinatorial random variables, it is impossible to apply existing MRF inferene tehniques to Comrafs, therefore a speially designed inferene method is used, whih is based on ombinatorial optimization. One of the plausible harateristis of Comraf models is ompatness: a useful Comraf model an onsist of just a handful of nodes. Compatness makes the model analysis muh easier; also, hoosing the best Comraf model (for a partiular problem) is a manageable task, as the number of possible variations is relatively small. In our future work, we will apply model learning, whih is feasible in Comrafs Comraf* model for multi-modal lustering Multi-modal (hard) lustering is a problem of simultaneously onstruting m partitionings of m data modalities, e.g. of images, their olors, their interest points, words in their aptions et. By lustering modalities simultaneously, one would overome statistial sparseness of the data representation, leading to a dense, smooth joint distribution of the modalities, whih would result in potentially more aurate lusterings than the ones obtained separately. Note that in most real-world ases, a pratitioner is interested in lustering only one modality (images, in our ase), whih we all a target modality. This implies that not every

3 modality has to be lustered: if a representation of a modality is dense enough, lustering it may ause an underestimation of the joint (an effet known as oversmoothing), whih may hurt lustering results of the target modality. For example, if images are distributed over 256 olors, it makes no sense to simultaneously luster images and olors beause the distributions are already dense enough. In this paper, we present a speial ase of Comraf models, in whih only the target modality is lustered, while the representations of all the other modalities are assumed to be dense enough. Eah unlustered modality is represented by an observed ombinatorial random variable. An observed random variable is a variable whose value is preset and fixed (traditionally, suh a variable is shaded on an MRF graph). Reall that a ombinatorial random variable is defined over all the possible lusterings of a given set. In ase of unlustered modalities, the observed value of a orresponding ombinatorial random variable is a lustering of all singleton lusters. For example, given a set {red, green, blue}, the observed value of a orresponding ombinatorial random variable is {{red}, {green}, {blue}}. Eah observed ombinatorial random variable of an unlustered modality is onneted by an edge with a hidden ombinatorial random variable of the target modality. Observed nodes are not onneted to eah other beause they are statistially independent by definition. Hene, the resulting topology of the Comraf model is an asterisk with the target modality in the enter. We all suh a model Comraf*. Examples of Comraf* graphs are given in Figure 1. Despite that only one modality is lustered in Comraf*, it is still a model for multi-modal lustering, as multiple modalities are involved in the lustering proess. The general Comraf model, however, takes are of any number of dense and sparse random variables. In Setion 3.2 we present a Comraf model for simultaneously lustering images and their loal features, while inorporating other (unlustered) modalities. Sine the simultaneous lustering an be omputationally hard, we show how to redue the omputational burden by translating suh a Comraf model into a number of Comraf* models, eah of whih is then optimized sequentially Inferene in Comraf and Comraf* Given a Comraf model over m ombinatorial random variables X = {X0,X 1,...,X m 1}, where X0 orresponds to the target modality, the task is to find the most probable instantiation of X (this task is ommonly referred to as the Most Probable Explanation, ormpe):x MPE = arg max x P (x ). Aording to the Hammersley-Clifford theorem [5], the joint distribution P (x ) is a Gibbs distribution: P (x )= 1 Z f exp j f j(x ), where f j (x ) are arbitrary potential funtions defined over liques in the Comraf graph, and Z f is a normalization fator alled a partition funtion. Let us onsider only the smallest liques, i.e. edges E. If the potential funtions are preset and fixed for eah edge, then the partition funtion beomes a onstant and thus the MPE problem is solved with: x MPE = arg max x e ij E f ij (x i,x j). (1) The simpliity of this model allows to hoose omplex, theoretially justified potential funtions. Here we adopt the idea presented in [3] in a similar setting: as a potential funtion f j, we hoose (weighted) mutual information between variables X i and X j, whih are defined over x i and x j respetively (see the beginning of Setion 2). Thus, our objetive funtion is: (x 0) MPE = arg max x 0 e ij E w ij I( X i ; X j ), (2) subjet to X i = k i and X j = k j. Mutual information between a lustering and another, interating random variable has a long history of being used in various unsupervised settings, starting with the Information Bottlenek [18], and inluding image lustering [13]. For a short review, see [4]. Weights w ij are by default set to 1; non-unity weights an be used to bring widely ranging mutual information terms to the same sale (see an example in Setion 3.2). In Comraf*, where all the edges are attahed to X0 and all the leaves are observed ombinatorial random variables, Equation (1) is transformed into: (x 0) MPE = m 1 m 1 = arg max f j (x 0,x j) = arg max w j I( X 0 ; X j ), x 0 x 0 j=1 j=1 (3) sine X j = X j for the unlustered modalities Comraf* optimization proedure To ompute the weighted sum of pairwise mutual information from Equation (3), the following proedure is used. The input of the proedure is an (empirial) joint distribution P (X 0,X j ) of the underlying data of eah interating pair (X0,X j ). For a given partitioning x 0, the distribution P ( X 0,X j ) is omputed using the umulative rule P ( x 0 ; x j ) = x 0 x 0 P (x 0,x j ). Marginals P ( X 0 ) and P (X j ) are obtained through the marginalization P ( x 0 )= x j P ( x 0,x j ) and P (x j ) = x 0 P ( x 0,x j ). Now we have all the ingredients to alulate the mutual information: I( X 0 ; X j )= x P ( x 0,x j ) log P ( x 0,x j ) P ( x 0 )P (x j ). 0,x j In most real-world ases, it is infeasible to find the global maximum in Equation (3) beause the number of possible

4 G W C G W B C G W C B G? W (a) (b) () (d) Figure 1. Comraf* models: (a) for images G and words W from their aptions; (b) for images, words and olors C ; () for images, words, olors and blobs B ; (d) straightforward generalization to any number of modalities. partitionings x 0 is exponentially large in the size of the X 0 s domain. We apply a loal optimization proedure, in whih we start with some lustering and greedily optimize the objetive while exploring a loal neighborhood of the initial onfiguration. The loal searh is performed by iteratively transferring eah value of X 0 from its urrent luster into a luster suh that the objetive is maximized. Note that we an apply Equation (3) only subjet to X 0 = k 0, i.e. only when the number of lusters is fixed, otherwise suh an optimization an result in a degenerative lustering of all singleton lusters. This does not imply, however, that hierarhial lustering is not appliable to our ase. Clusters an be arbitrarily merged or split, after whih the number of lusters should be fixed and then it is safe to optimize Equation (3). When the greedy optimization onverges to a loal maximum, we may again merge or split lusters, after whih we proeed with the next iteration of the optimization routine et. Thus, we an employ both top-down and bottom-up lustering proedures. In the top-down proedure, we start with one luster that ontains all the values of X 0 and split it until the required number of lusters is obtained (while interleaving with the optimization routine). In bottom-up lustering, we start with all singleton lusters and merge them until, again, reahing the required number of lusters. Computational omplexity of the top-down algorithm is O(l X 0 m 1 j=1 X j ), and of the bottom-up algorithm O(l X 0 2 m 1 j=1 X j ), where l is a (fixed) number of lustering iterations. Note that an arbitrary number of leaves (unlustered modalities) an be inorporated into the Comraf* model, while adding new modalities inreases the omplexity only linearly. 3. Modalities In this work, along with images, we onsider three other modalities. The first one is words from image aptions. We remove stopwords and apply simple s -stemming (removal of plural suffixes). A joint probability of an image g and awordw is P (g, w) = Nw g W, where N w g is the number of ourrenes of w in g s aption, W is the total number of words. Another modality is olors appearing in images. The joint probability distribution of olors and images is obtained from olor histograms, as a number of pixels of olor in image g divided by the total number of pixels in all images. The third modality is blobs, as desribed below Retangular blobs Blobs (or visual terms) are a speial type of image ontent representation based on a fixed voabulary. To generate blobs, images are first segmented into regions, whih are then lustered aross all images. Blobs are the resulting region lusters. Eah image is mapped onto the set of blobs whih leads to in a representation analogous to the bag-of-words (BOW) in text proessing. Barnard and Forsyth [2] and Duygulu et al. [9] segment images into semantially oherent regions using Blobworld and Normalized-Cuts algorithms. Unfortunately, these algorithms do not always produe segmentations aurate enough for further use. Jeon and Manmatha [15] and Feng et al. [10] use a retangular grid to segment images and report better results on an image retrieval task. We apply the same set of blobs as in [10], built using the following proedure. Images are first segmented to regions using a 6 by 4 grid. Then, for eah region, a feature vetor is onstruted that ontains texture and olor information: Gabor texture filters with 4 orientations and 3 sales are used to onstrut 12 dimensional texture features; the mean, standard deviation and skewness of RGB and LAB omponents are omputed to build 18 dimensional olor features. The resulting 30 dimensional feature vetors are lustered using k-means Blobs onstruted by Comraf models As disussed in Setion 3.1 above, a lustering proess is involved in onstruting blobs from retangular regions, represented by olor and texture features. Naturally, sine Comrafs are models for multi-modal lustering, an intrinsi Comraf model an be used for simultaneously lustering images and their regions. Co-lustering of images and features has been reently desribed in literature [17], however, Comrafs have an additional power over o-lustering methods: Comrafs an inorporate multiple modalities, both sparse (that are to be lustered) and dense (that are not). Figure 2 (left) shows a Comraf model for lustering images G simultaneously with their regions R, taking into aount olor C and texture T information of the regions, as well as the olors and aption words W of the images. Obviously, more edges and nodes an be added to the model, depending on the data availability. In Setion 2.3 we mentioned that the input of a Comraf inferene proedure is a set of pairwise probability tables P (X i,x j ) for eah edge in the Comraf graph. An interesting ase is the (G,R ) edge between image and region

5 R T C G W 1. R T C 2. R C G W Figure 2. (left) A Comraf model for simultaneously lustering images G and their retangular regions R, while taking into aount words W from image aptions, olors C and texture data T ; (right) a translation of this model into a two-step Comraf*: the first Comraf* is for lustering regions into blobs, whereas the seond Comraf* is for lustering images based on these blobs. ombinatorial random variables in Figure 2 (left). Unlike olors and aption words, eah region is unique, so for eah region r and eah image g, their joint probability is: P (r, g) = { 1 R, if r g 0, otherwise, where R is the total number of regions in the dataset. Suh a probability mass funtion is useless for lustering regions, beause only regions that belong to the same image an be lustered together. A possible way to resolve this problem would be to estimate this probability by giving a portion of its mass to P (r, g) even if r/ g. Suh an estimation an be made based on omputing similarities between regions of various images, whih is omputationally hard: O( R 2 r ), where r is the size of any region. Comrafs offer an elegant solution to this problem: sine regions are lustered not only based on images, but also based on olors and texture, neither of whih has this problem, we still an use Equation (4). As long as images are lustered in parallel with regions, Equation (4) allows grouping together regions that belong to the same image luster, as desired. Therefore, we apply the Comraf model from Figure 2 (left) as it is. We hoose to luster images bottom-up and regions top-down. In our objetive funtion (2), we ope with the fat that I( R; G) is two orders of magnitude larger than I( R; C) and I( R; T ), by setting the weights of the latter two terms to 100. The simultaneous lustering of images and regions is a time onsuming proess: its omplexity is O( R G ( C + T + W )). We propose a light-weight version of this model, in whih inferene is done in two steps: first, regions are lustered based on their olor and texture features, and then images are lustered based on olors, aption words and region lusters. Suh a model is equivalent to a set of two Comraf* models presented in Figure 2 (right). This model s omplexity is plausible: O( R ( C + T )+ G ( C + R + W )), where R is the number of region lusters. Generalizing this setting, it is easy to see that any Comraf an be translated into a series of Comraf* models. (4) Category # of images Category # of images Birds 152 Christianity 191 Desert 172 Islam 96 Flowers 165 Judaism 187 Trees 190 Personalities 188 Food 187 Symbols 130 Housing 165 OVERALL: 1823 Table 1. Categories (and their sizes) of the IsraelImages dataset. 4. Experimentation We experiment with a variety of partiular Comraf* models (see examples in Figure 1), as well as with the general Comraf models from Figure 2. The experiments are onduted using our open-soure Comraf lustering tool. 1 In all our models, images are lustered agglomeratively. All our results are averaged over 10 independent runs, with the standard error reported. As a baseline, we use the k-means algorithm (SimpleKMeans implementation of WEKA 2 ), where images are represented as BOW of their aptions. Also, our 2-node Comraf* model is equivalent to the hardlustering version of Information Bottlenek (IB) [18] (see [4] for disussion), hene we use it as our baseline as well. For evaluation of our lustering results, we use image lustering auray. LetC be the set of ground truth ategories. For eah image luster g, letµ C ( g) be the maximal number of elements of g that belong to one ategory. Then, the preision of g with respet to C, is defined as Pre( g, C) = µ C ( g)/ g. The miro-averaged preision of the entire lustering g is: Pre(g, C) = g µ C( g)/ g g, whih is the portion of douments that belong to the dominant ategories. For all our experiments, we fix the number of lusters to be equal to the number of ategories, thus Pre(g, C) equals lustering auray Datasets We demonstrate the performane of our lustering methods on two datasets: a subset of the benhmark Corel dataset and a new multimedia dataset, whih we refer to as IsraelImages, olleted by us espeially for this work. The Corel subset 3 has already been used in various previous researh projets [9, 14, 10]. The dataset onsists of 5,000 images from 50 Corel Stok Photo CDs. Eah CD ontains 100 images on the same topi, suh as Sunrises and Sunsets, Mountains of Ameria and Wild Animals. Every image has a aption and an annotation. The aption is a brief desription of the sene and the annotation is a list of objets that appear in the image. An example of an image aption is Man And Boy Fishing Mountain 1 omraf.soureforge.net 2 s.waikato.a.nz/ml/weka 3 kobus.a/researh/data/ev_2002

6 Method Auray k-means: images over aption words 22.0% IB: images/aption words 44.2 ± 1.0% IB: images/olors 24.4 ± 0.2% Comraf*: images/words/olors 54.2 ± 0.9% Two-step Comraf*: Figure 2 (right) 69.0 ± 0.6% General Comraf: Figure 2 (left) 68.6 ± 1.0% Table 2. Clustering results on the IsraelImages dataset. All IB/Comraf results are averaged over 10 independent runs with the standard error of the mean reported after the ± sign. Lake, while Tree People Mountain Water is an annotation for this image. Overall 371 words are used to annotate the olletion. The original dataset has 4,500 training images and 500 test images. Sine our model does not require training, we use 4,500 training images for our experiments and save the remaining 500 images for future use. The seond dataset onsists of 1823 images downloaded from IsraelImages.om. The images reflet main aspets of Israel senery/soiety and are grouped into 11 ategories (see Table 1). Eah image is 375 by 250 pixels and has a 1 to 18 words long aption. This dataset is available to the researh ommunity Results and disussion Our results on the IsraelImages dataset are reported in Table 2. Adding the olor modality to the aption BOW improves the lustering result by 10% (on an absolute sale), whereas adding the regions (in a 2-step Comraf* sheme) leads to an additional 15% improvement. These findings demonstrate the value of multi-modal setting in image lustering. The general Comraf model from Figure 2 (left) is not able to outperform the 2-step Comraf*. This is probably due to the fat that olor and texture information is more important for lustering regions than the orrespondene between regions and image lusters. We also experiment with various levels of olor granularity in a 3-node Comraf* setting (from Figure 1b) the results are presented in Figure 3 (left). As an be seen, if the olor information is detailed enough (above 216 olors), the differene in the results is statistially insignifiant. Figure 3 (enter) shows the results of the 2-step Comraf* over various numbers of olors for lustering regions. Generally, less olors are needed for lustering regions than for lustering images: 216 olors appear to be too many. A summary of our results on the Corel dataset is presented in Table 3. It shows surprisingly similar trends as for IsraelImages. On a 3-node setup with aption words and blobs we obtain 59.4% auray, whih is espeially impressive given that a random assignment of images into ronb/image_lustering.html Method Auray k-means: images over aption words 22.0% IB: images/aption words 46.6 ± 0.5% IB: images/olors 22.5 ± 0.2% IB: images/blobs (see Setion 3.1) 24.7 ± 0.3% Comraf*: images/words/olors 55.3 ± 0.5% Comraf*: images/words/blobs 59.4 ± 0.5% Comraf*: images/words/olors/blobs 60.1 ± 0.3% Two-step Comraf*: Figure 2 (right) 61.2 ± 0.4% IB: images/annotation words 58.6 ± 0.3% Table 3. Clustering results on the Corel dataset. All IB/Comraf results are averaged over 10 independent runs with the standard error of the mean reported after the ± sign. lusters would lead to 2% auray (our result is 30 times above random). Adding the olor modality improves this result only insignifiantly (as expeted, sine blobs already inorporate the olor information, among with texture). The suess of 3-node and 4-node Comraf* lustering models is also supported by the fat that they outperform a 2-node supervised lustering model, in whih images are lustered with respet to their annotations assigned by human experts. The 2-step Comraf* shows some further (minor) improvement over the 1-step Comraf* models. Here, in ontrast to IsraelImages, 8 olors are enough for lustering regions, and adding more olors auses a signifiant drop in the performane. We suspet that the Corel dataset is too simple : it ontains many images that are almost idential to eah other, therefore more advaned lustering models lead to no (or minor) gain over the simpler ones. Analogously to our IsraelImages experiment with various sizes of olor sets, we test various numbers of blobs on Corel. In previous work [9, 14], the number of blobs is set to 500, to (roughly) orrespond to the number of annotation keywords. Here we show that 500 blobs are not enough for lustering: when moving from 1000 to 2000 blobs, a signifiant boost in the system s performane an be seen. Figures 4 and 5 are illustrations of the quality of multimodal setup: unrelated groups of images are mixed together when the lustering is based only on aption words, whereas they are niely separated when a visual modality is added. 5. Conlusion In this paper, we have introdued the powerful Comraf framework to lustering multimedia olletions. We have also proposed a family of lightweight Comraf models alled Comraf*, whih demonstrate exellent performane on lustering two real-world data olletions. To further improve the results, a semi-supervised Comraf setting [4]may be used, in whih a few labeled examples are taken into aount in the lustering proess. We plan to experiment with this setting in our future work.

7 Clustering auray on IsraelImages images/words images/words/olors Clustering auray on IsraelImages Clustering auray on Corel images/words images/words/blobs auray auray images/words 2 step Comraf* auray number of olors for image lustering number of olors for region lustering number of blobs Figure 3. Experimentation with various numbers of: (left) olors on IsraelImages in a 3-node images/words/olors Comraf*; (enter) olors for lustering regions in the 2-step Comraf* on IsraelImages; (right) blobs on Corel in a 3-node images/words/blobs Comraf*. Our baseline is the 2-node images/words lustering result. Left and right graphs show the same trend: after reahing a ertain number of olors (256) or blobs (2000), the results vary only insignifiantly. The entral graph, however, shows that too many olors for lustering regions an hurt. Designing general Comraf models for image lustering (in flavor of the model shown in Figure 2 left) is an ongoing proess. Various setups should still be tested, various modalities should be inorporated. For example, interest points of images is an important modality not to be ignored. Extensive experimentation will lead to disovering the optimal Comraf setting for lustering multimedia olletions. Aknowledgements This work was supported in part by the Center for Intelligent Information Retrieval and in part by the Defense Advaned Researh Projets Ageny (DARPA) under ontrat number HR C Referenes [1] K. Barnard, P. Duygulu, and D. A. Forsyth. Clustering art. In Proeedings of CVPR, pages , [2] K. Barnard and D. A. Forsyth. Learning the semantis of words and pitures. In Proeedings of ICCV-8, pages , [3] R. Bekkerman, R. El-Yaniv, and A. MCallum. Multi-way distributional lustering via pairwise interations. In Proeedings of ICML-22, pages 41 48, [4] R. Bekkerman, M. Sahami, and E. Learned-Miller. Combinatorial Markov Random Fields. In Proeedings of ECML- 17, , 3, 5, 6 [5] J. Besag. Spatial interation and statistial analysis of lattie systems. Journal of the Royal Statistial Soiety, 36(2): , [6] D. Cai, X. He, Z. Li, W.-Y. Ma, and J.-R. Wen. Hierarhial lustering of www image searh results using visual, textual and link information. In Proeedings of the 12th international onferene on Multimedia, pages , [7] Y. Chen, J. Z. Wang, and R. Krovetz. Clue: Cluster-based retrieval of images by unsupervised learning. IEEE Transations on Image Proessing, 14(8): , [8] I. S. Dhillon. Co-lustering douments and words using bipartite spetral graph partitioning. In Proeedings of SIGKDD-7, pages , [9] P. Duygulu, K. Barnard, N. de Freitas, and D. Forsyth. Objet reognition as mahine translation: Learning a lexion for a fixed image voabulary. In Proeedings of ECCV-7, , 5, 6 [10] S. L. Feng, R. Manmatha, and V. Lavrenko. Multiple bernoulli relevane models for image and video annotation. In Proeedings of CVPR, pages , , 5 [11] B. Gao, T.-Y. Liu, T. Qin, X. Zheng, Q.-S. Cheng, and W.-Y. Ma. Web image lustering by onsistent utilization of visual features and surrounding texts. In Proeedings of the 13th international onferene on Multimedia, [12] B. Gao, T.-Y. Liu, X. Zheng, Q.-S. Cheng, and W.-Y. Ma. Consistent bipartite graph o-partitioning for star-strutured high-order heterogeneous data o-lustering. In Proeeding of SIGKDD-11, pages 41 50, [13] J. Goldberger, S. Gordon, and H. Greenspan. Unsupervised image-set lustering using an information theoreti framework. Transations on Image Proessing, 15, [14] J. Jeon, V. Lavrenko, and R. Manmatha. Automati image annotation and retrieval using ross-media relevane models. In Proeedings of SIGIR-26, pages , , 6 [15] J. Jeon and R. Manmatha. Using maximum entropy for automati image annotation. In Proeedings of the 5th International Conferene on Image and Video Retrieval, pages 24 32, [16] N. Loeff, C. O. Alm, and D. A. Forsyth. Disriminating image senses by lustering with multimodal features. In Proeedings of COLING/ACL, pages , [17] G. Qiu. Image and feature o-lustering. In Proeedings of ICPR-17, pages , [18] N. Tishby, F. Pereira, and W. Bialek. The information bottlenek method, Invited paper to the 37th Allerton Conferene on Communiation, Control, and Computing. 3, 5 [19] Y. Uematsu, R. Kataoka, and H. Takeno. Clustering presentation of web image retrieval results using textual information and image features. In Proeedings of the 24th IASTED international onferene on Internet and multimedia systems and appliations, pages ,

(a) Clustering results using only aption words, Corel dataset (b) Clustering results using words and blobs, Corel dataset Figure 4. Corel dataset. The first row shows lustering results using only words.

The seond and the third rows show lustering results using both words and blobs.

(a) Clustering results using only aption words, IsraelImages dataset (b) Clustering results using words and olor histograms, IsraelImages dataset Figure 5.

8 (a) Clustering results using only aption words, Corel dataset (b) Clustering results using words and blobs, Corel dataset Figure 4. Corel dataset. The first row shows lustering results using only words. Swimmers and swimming tigers are lustered together beause they share ommon terms like water and swim. The seond and the third rows show lustering results using both words and blobs. The swimmers and the swimming tigers are now in two different lusters with other similar images. (a) Clustering results using only aption words, IsraelImages dataset (b) Clustering results using words and olor histograms, IsraelImages dataset Figure 5. IsraelImages dataset. People portraits and pitures of the menorah monument are lustered together using aption words beause they have a word Knesset (the Israeli parliament) in ommon: the individuals are Knesset members, while the menorah monument is plaed in front of the Knesset building. The problem is resolved after the olor modality is added.

Learning Convention Propagation in BeerAdvocate Reviews from a etwork Perspective. Abstract

CS 9 Projet Final Report: Learning Convention Propagation in BeerAdvoate Reviews from a etwork Perspetive Abstrat We look at the way onventions propagate between reviews on the BeerAdvoate dataset, and