Visual Dictionary: Towards a Higher-level Visual Representation for Object Categorization. CHUA, Tat-Seng School of Computing

Size: px

Start display at page:

Download "Visual Dictionary: Towards a Higher-level Visual Representation for Object Categorization. CHUA, Tat-Seng School of Computing"

Lesley Weaver
5 years ago
Views:

of Computing Outline of Talk Introduction Current Approaches

1 Visual Dictionary: Towards a Higher-level Visual Representation for Object Categorization CHUA, Tat-Seng School of Computing Outline of Talk Introduction Current Approaches Towards Higher Level Visual Representation Current Developments Conclusions 1

The New Information Age The Internet has revolutionalized the way information is

text, to include mm data, and live media Emergence of huge amount of end-user

connectivity leads to huge amount of live info WWW has also rapidly gone mobile,

com suggest a ~22.34 billion indexed pages (Sep.

indexed MM contents increasing at exponential rate 31 million hours of video were

2 The New Information Age The Internet has revolutionalized the way information is created, disseminated and consumed Mixture of info available has changed from purely text, to include mm data, and live media Emergence of huge amount of end-user generated data & managed systems like Wikipedia, and more recently Wikimedia Greater connectivity leads to huge amount of live info WWW has also rapidly gone mobile, permitting access from anywhere How Big is Internet? suggest a ~22.34 billion indexed pages (Sep. 2007) Studies claimed that the deep (unindexed) web is ~500 times larger than the indexed MM contents increasing at exponential rate 31 million hours of video were produced each year (2006) Over 65% of Web traffic is on mm contents (2007) Like deep Web, the amount of info available in live Web is unknown 2

Can we process the array of sensors to find out if a missing girl in blue dress & red bag has passed by the area in last few days?

3 Information Bottleneck? Users retrieve text and mm info on routine basis But other than text, it is hard to find nontextual media.. Semantic gaps - non-textual media are hard to process How to efficiently find these images when there is no text annotation? What about live sensor media? Can we process the array of sensors to find out if a missing girl in blue dress & red bag has passed by the area in last few days? Visual Content Analysis Focus on visual analysis in this talk Need to automatically analyze image/video contents to extract semantic information Useful for processing both traditional and live media A statement of the problem: given an image (or video), what concepts are present Car, Building, Road, City, Outdoor? Cow, Truck, Road, Village, Outdoor? Pool, Landscape, Indoor 3

4 Visual Content Analysis -2 CHALLENGES How to ensure correct and complete annotation How to make use of multi-faceted knowledge (like social annotated keywords, associated text from Web pages etc) Auto extraction of concepts has many applications Concept annotation and propagation assign new or additional concepts to image/video set Image/video retrieval Simulation study shows that average concept detection accuracy of above 0.1 is useful to general image retrieval Outline of Talk Introduction Current Approaches Towards Higher Level Visual Representation Current Developments Conclusions 4

5 Visual Content Representation Current Approaches employ a combination of low-level and high-level visual features Low level global features Color histogram; Color correlegram; Grid-based color moments Edge/direction histogram; Texture Low level local features Local-region descriptors such as SIFT Bag of Words Motion Higher level features Face detection (and recognition) Text (user annotated or extracted form Web) Simple high-level features Low-Level Global Features The most widely used low-level global features: Color histogram; Color correlegram; Grid-based color moments Edge/direction histogram; Texture Provided in most publicly available test dataset NUS-WIDE data set, comprising 270,000 images Pre-compute all of the above features for every images Advantages Easy and efficient to extract and use Seems effective Retrieval example based on TRECVID dataset 5

Concept Annotation in TRECVID TRECVID: Large-scale video annotation and retrieval evaluation hosted by NIST o TRECVID-07: 100-hour documentary video corpus or ~40K keyframes o Annotate 20 concepts

6 Concept Annotation in TRECVID TRECVID: Large-scale video annotation and retrieval evaluation hosted by NIST o TRECVID-07: 100-hour documentary video corpus or ~40K keyframes o Annotate 20 concepts Concept Annotation in TRECVID -2 Global visual features used o Color histogram (CH) o color moment (CM) o wavelet texture (WT) o texture co-occurrence (TC) Machine Learning Techniques: o Use SVM for uni-modal training o SVM and averaging for fusion Performance: o A good baseline performance CM CH WT TC Fusion MAP o Could achieve using OWA fusion CM CH WT TC Fusi on 6

features In the order of 500-2,000 words Represent image as a histogram of visual words visual codebook

7 Low-Level Local Features -1 SIFT (Scale-invariant feature transform) A popular local-region descriptors It is effective and robust Bag of visual words (or alphabets): A quantized (clustered) vector of SIFT features In the order of 500-2,000 words Represent image as a histogram of visual words visual codebook Low-Level Local Features -2 Matching examples Un-filtered example Filtered examples Filtered examples 7

Concept annotation in TRECVID Testing on TRECVID 2007 dataset with the used of Bag of words (SIFT) features o Shows vast improvement in results over the use of global l features CM CH WT TC Bag of

8 Concept annotation in TRECVID Testing on TRECVID 2007 dataset with the used of Bag of words (SIFT) features o Shows vast improvement in results over the use of global l features CM CH WT TC Bag of Fusion words MAP CM CH WT TC Bow Fusi on TRECVID Annotation Examples Accuracy still not sufficient < MAP of 0.15 Success Failure sports computer screen boat 8

Useful for consumer and home photos where people is key Face Detected Face Body Mask Body Outline of

9 Higher Level Features Face: a commonly-used high level features Very accurate for frontal face detectors Reasonable for multi-view face detectors {face body body matching} for people identification Useful for consumer and home photos where people is key Face Detected Face Body Mask Body Outline of Talk Introduction Current Approaches Towards Higher Level Visual Representation Current Developments Conclusions 9

10 Problems with Current Visual Representation Local-level local features based on bags of visual words are effective to certain extent But it is still too low-level Ideal visual representation Discriminative among object classes Large inter-class distance for categorization Invariant within the class Small intra-class distance for generalization Visual word is not discriminative or Invariant Topological proximity in visual word space > semantic relevance As objects can have arbitrarily diff. appearances Semantic Gap Visual word is not discriminative or invariant One visual word may have diff. semantic meaning in diff. context o Visual word is a result of vector quantization o Objects of different classes may share similar local appearances A polysemy phenomena visual word patch descriptor Multiple visual words may have Multiple visual words may have same semantic meaning o Diversity of object of same class o Consequence: large intra-class variations A synonymy phenomena visual word 10

Towards Higher Level Visual Representation Several recent efforts aim towards higher level visual representation o 1 st category: Explore deeper understanding of how the brain perceives the world to

11 Towards Higher Level Visual Representation Several recent efforts aim towards higher level visual representation o 1 st category: Explore deeper understanding of how the brain perceives the world to improve computer vision (Ullman, 2007; Serre, 2007; Karklin & Lewicki, 2008) o 2 nd category: Extend computer vision approach to extract more distinctive higher level feature configurations (Yuan, 2007; Quack et al., 2007) o 3 rd category: Borrow idea from NLP to extract higher level representation (Zheng et al. 2008) o The availability of large scale knowledge base such as Wikimedia and on-line visual dictionaries helps 1 ST Category: Brian-Media Approach -1 Observations: There is still huge gaps between how human brain and computers view objects Semantic hierarchies have been shown to be useful for object recognition and concept annotation Ullman et al. (2007) built feature hierarchy for object recognition based on the informative fragment Demonstrated for a few specific classes Informative fragment examples Hierarchy of abstract features 11

S2 C2 Correspond to simple cells in primary visual cortex (V1) Correspond cortical complex cell Selectivity: Gaussian like function for tuning and specificity Tolerance: Maximum like operation for

12 1 ST Category: Brian-Media Approach -2 To understand which features are extracted and represented by visual context, Serre et al. (2007) provide a hierarchical feature representation with Cortex-Like Mechanisms o The model consists of four layers, that replicate the way neurons process input and output stimuli Input image S1 C1 S2 C2 Correspond to simple cells in primary visual cortex (V1) Correspond cortical complex cell Selectivity: Gaussian like function for tuning and specificity Tolerance: Maximum like operation for invariance over positions and scales The C 1 and C 2 standard model features (SMF) are extracted for object recognition and scene understanding 1 ST Category: Brian-Media Approach -3 Karklin et al. (2007) presented a computational model by learning how our brain sees natural scenes o The model neurons encode the common statistical structure most consistent with a given image o Neuronal activities show a diverse range of properties observed in cortical cells and have strong discriminativeness Linear feature are overlapping Feature statistical pattern distinguish them 12

1 ST Category: Brian-Media Approach -4 Karklin et al. (2007) cont.

pattern in image regions image regions Distribution coding model 2 nd Category: Vision-based

(2007) proposed a method to discover high level visual phrases according to spatial co-occurrence

13 1 ST Category: Brian-Media Approach -4 Karklin et al. (2007) cont. Activation patterns for model neurons model neurons Images are better separated Distribution pattern in image regions image regions Distribution coding model 2 nd Category: Vision-based Approach -1 Yuan et al. (2007) proposed a method to discover high level visual phrases according to spatial co-occurrence patterns o Visual phrases are discovered by using frequent itemize mining o Each visual phrase is composed of meaningful co-located visual words Examples of meaningful itemsets from car category (red rectangle) 13

2 nd Category: Vision-based Approach -2 Quack et al.

frequently in a given object class, and rarely on the background o Features in background are pruned away o The meaningful feature configurations often

description Examples of discriminant frequent spatial configurations 3 rd Category: Language Approach -1 Overview Recall BoVW is not discriminative and invariant

14 2 nd Category: Vision-based Approach -2 Quack et al. (2007) apply data-mining technique to extract discriminant higher-level feature configurations o Find spatial configurations of local features occurring frequently in a given object class, and rarely on the background o Features in background are pruned away o The meaningful feature configurations often corresponds to semantic object parts, such as the motorbike wheels Mind frequent configuration Class-specific feature selection Neighborhood based image description Examples of discriminant frequent spatial configurations 3 rd Category: Language Approach -1 Overview Recall BoVW is not discriminative and invariant because of: o Objects of different classes may share similar local appearances A polysemy phenomena o Multiple visual words may have same semantic meaning A synonymy phenomena Borrow idea from linguistics: o Alphabets Words Phrases Synonym Word Sense parse tree dependency and semantic parsing etc.. 14

15 3 rd Category: Language Approach -2 Visual Phrase Leverage spatial and contextual info to build more distinctive patterns Visual phrase: spatially cooccurring visual words in a support region o Could formulate visual phrase discovery as a task of Frequent Itemset Mining (FIM) o Discover groups of co-occurring visual words in a spatial neighborhood o Example: visual phrase BC Visual phrase AB visual word A visual word B 3 rd Category: Language Approach -3 Delta Visual Phrase Weakness of visual phrase: contains only spatial co-occurring info Design delta visual phrase to leverage contextual info Definitions Let R 1, R 2,, R k, R k+1, be a set of regions with same centroid and increasing sizes Minimal i Support Region: region R k is called minimal support region of visual phrase π, if any smaller region R k-1 is not large enough to discover the visual phrase π 15

16 3 rd Category: Language Approach -4 Delta Visual Phrase Definitions Delta visual phrase of region R k is: newly discovered visual phrases, when the support regions just grows from R k-1 to R k. visual phrases that have R k as minimum support region Delta visual phrase is mined by changing size of support regions When support region R k increases, o its delta visual phrases cross over larger regions o capture both spatial co-occurring and contextual info of its visual words o Example, AB will be a newly discovered visual phrase as we increase from R 1 to R 2 3 rd Category: Language Approach -5 Visual Synset: a more invariant visual unit Though more distinctive, delta visual phrase suffers from: o Topological proximity in feature = visual similarity semantic relevance o Large intra-class invariant Visual synset: a higher level visual unit o Synset (synonymy set): a set of words with similar semantic o Define the probabilistic semantic of a visual word/phrase w: class probability distribution P(c w) contribution to the classification of its belonging image 16

17 3 rd Category: Language Approach -6 Visual Synset: a more invariant visual unit Rational of visual synset: o Many visual words are intrinsic and indicative to certain classes o These visual words tend to share similar probability distribution P(c w), which peaks around its belonging classes Explore Information Bottle Principle to guide clustering process, to optimize data compression in clustering probabilistic semantic 3 rd Category: Language Approach -7 Info Bottleneck (IB) Principles Input: Joint distribution P(w, c) of visual words/ phrase w and image classes c Goal: construct the optimal compact representation of w, namely visual synset clusters s, o s preserves as much information as possible about c Solution: Lagrange optimization LPs [ ( c )] I ( SC ; ) IW ( ; S ) mutual info of synsets and classes info. loss in clustering visual words/phrases into synsets. IB Implementation: Sequential IB Clustering 17

18 3 rd Category: Language Approach -8 Experiments Testing Datasets: Caltech-101; 102 categories & 9233 images Experimental Setup: o For each class, 30 randomly selected images for training, the rest for testing Evaluation Criteria: o Image Classification: mean classification accuracy Visual word generation: o Keypoint: Difference of Gaussion (DoG) and Hessian Laplacian o Descriptor: SIFT and Spin; 2x2 image grids o Vector Quantization: K-means, 1100 clusters Classifier: SVM with RBF kernel 3 rd Category: Language Approach -9 Experiments Experiment 1: Uses only visual words o Accuracy = 57.2% Experiment 2: Incorporate delta visual phrase o Setting support region size to 4, 8 and 12 o Varying delta phrases from 1100, 1200, 1300, 1400, 1500, 1700, 1800 to 2000 o At 1400 delta visual phrases, Accuracy = 60.2 Examples of delta visual phrases

19 3 rd Category: Language Approach -10 Experiments Experiment 3: Incorporate Visual Synset o Generate visual synsets from codebook of 1400 delta visual phrases o Set cardinality of visual synsets to: 50, 100, 200, 400, 600, 800, 1000 and 1200 o Best accuracy: 62.6% Examples of visual synset 3 rd Category: Language Approach -11 Experiments: Benchmark RUN Visual synset Visual Delta visual Visual words phrase synset Accu(%) Comparison with reported systems [Grauman 05] [Berg 05] [Zhang 05] [Lazebnik 06] [Bosch 07] Accu(%) NOTE: [Bosch 07] exploited more features with complex kernel matrix learning 19

20 3 rd Category: Language Approach -12 Observation Visual synset can give superior accuracy o 600 visual synsets: 62.6% Visual synset can give compact representation o 50 visual synsets: 55.2% Visual synsets fuse semantic consistent visual words/phrases together o reduces the intra-class variations and o renders the image distribution more coherent and manageable Visual synset is a result of supervised dimensionality reduction o properly reduced dimensionality partially resolve statistical sparseness problem Outline of Talk Introduction Current Approaches Towards Higher Level Visual Representation Current Developments Conclusions 20

21 Moving Forward What s next for our approach? o Recall our starting point, that, major causes of low level of representative power are: A polysemy phenomena A synonymy phenomena Borrow idea from linguistics: o Alphabets Words Phrases Synonym Word Sense parse tree dependency and semantic parsing etc.. Natural next phase is to explore: o Word Sense parse tree dependency and semantic parsing issues.. o Build basic visual units or visual vocabulary describe image/video contents Modeling Visual Context As images are easier to obtain, problem of visual diversity of objects (concepts) becomes worse o For example, there are 79,695 category of buildings in Wikimedia Commons. Each category shows unique visual property How to deal with this problem? o Needs more discriminative feature representation o Strong generalization or large training data o Context is the key 21

Modeling Visual Context -2 The foundation of both generative model and classification based method is the estimation of similarity between images The two images show

sky mountain sea Context is the KEY Modeling Visual Context -3 Training Data from, say, Wikimedia o Every image is assigned to a category in a hierarchy o Images in the

22 Modeling Visual Context -2 The foundation of both generative model and classification based method is the estimation of similarity between images The two images show little visual similarity, but express similar semantics. There is inconsistency between visual & semantic similarity. How to deal with this problem? sky mountain sea Context is the KEY Modeling Visual Context -3 Training Data from, say, Wikimedia o Every image is assigned to a category in a hierarchy o Images in the same category share similar semantics; most also share visual similarity Advantages: o Large scale, people involved o The images in the category are relatively pure blue sky sky cloudy sky sunset sky 22

23 Modeling Visual Context -4 Examine Visual characteristics at object level o Traditional visual features are at image level o Alternative: utilize multiple segmentations of images to perform visual categorization at image region level o Inference of image semantic does not require accurate boundaries of objects Model visual hierarchy and context of different visual classes/themes o Rational: though images from different classes can share the same set of visual themes (parts), the distribution and combination of visual themes tend to be different for different image classes o One possible approach: utilize hierarchical Dirichlet processes to model distribution & context of visual themes Large Datasets Available Fortunately, large scale datasets such as Wikimedia are available: o Clean concept hierarchy o large amount of high quality sample images o Explicit visual component relationships Wikimedia o For example: Transport hierarchy o Visual ontology can be used to support object recognition and concept annotation 23

24 Transpor t Transpor t by country Road transport Cable transport Aviation Bicycle transport Child transport Patient transport Bus transport Objects on roads Road accidents Icons for road Road vehicles Carriages Automobil es Bookmobil es Buses Motorcycle s Car company logos Classic cars Automobil e parts Tires Car seats Automobil es lights Automobil e grills Car door handles Vehicle Mirrors Automobil egauges Transpor t Transpor t by country Road transport Bus transport Cable transport Objects on roads Aviation Road accidents Bicycle transport Patient transport Icons for road Road vehicles Carriages Automobil es Bookmobil es Child transport Buses Motorcycle s 24

Automobil e grills Car door handles Tires Car seats Automobil e lights

25 Carriages Automobil es Car company logos Road vehicles Bookmobil es Buses Motorcycle s Classic cars Automobil e parts Tires Car seats Automobil e lights Automobil e grills Car door handles Tires Car seats Automobil e lights Automobil e parts Automobil e grills Car door handles Vehicle Mirrors Automobil egauges 25

hierarchy o large amount of high quality sample images o Explicit visual component relationships Wikimedia o For example: Transport hierarchy o Visual ontology can be used to support

26 The built visual ontology can be used to support object recognition and concept annotation. car Road vehicle objects + visual features detect street motorcycle Large Datasets Available Fortunately, large scale datasets such as Wikimedia are available: o Clean concept hierarchy o large amount of high quality sample images o Explicit visual component relationships Wikimedia o For example: Transport hierarchy o Visual ontology can be used to support object recognition and concept annotation NUS-WIDE: A Real World Web Image Dataset o > 300k images & associated tags crawled from Flickr o Preserve 269,648 images and their tags o Contain 5,018 unique tags o Provide 6 sets of visual features o Offer ground truth for 81 concepts 26

27 NUS-WIDE: Web Image Dataset -1 The Frequency Distribution of Tags The Number of Tags per Image 53 NUS-WIDE: Web Image Dataset -2 Most frequent tags after noise removal sky light nature sunset water sea blue white clouds people bravo night landscape beach green architecture red art explore travel

28 NUS-WIDE: Web Image Dataset -3 List of visual features extracted: Global Features o 64-D color histogram; o 144-D color correlogram; o 75-D edge direction histogram; o 128-D wavelet texture; Grid-based Features o 225-D block-wise color moments, which are extracted over 5x5 fixed grid partitions; Bag of Visual Words o 500-D bag of words based on SIFT descriptions on key- points in the images 55 NUS-WIDE: Web Image Dataset -4 Ground Truth for 81 concepts for evaluation o They are consistent with those concepts described in other literatures, such as COREL, CALTECH101 and LSCOM; o They most correspond to the frequent tags in Flickr; o They have both general concept such as animal, and specific concepts such as dog and flowers ; o They belong to different categories including object, scene, event, people, etc. The overall effort for manual annotation o About 3,000 man-hours 56 28

29 NUS-WIDE: Web Image Dataset NUS-WIDE: Web Image Dataset -6 Exemplars for some concepts Airport Birds Beach Bridge Car Cityscape Dancing Fire Explosion Glacier Map Mountain Police Railroad Rainbow Reflection Sign Sports Temple Waterfall Wedding 29

NUS-WIDE: Web Image Dataset -7 Statistics of Relevant Images 59 NUS-WIDE: Web Image Dataset -8 Benchmark annotation by Learning from the tags: knn MAP: 0.

30 NUS-WIDE: Web Image Dataset -7 Statistics of Relevant Images 59 NUS-WIDE: Web Image Dataset -8 Benchmark annotation by Learning from the tags: knn MAP: According to CMU s simulations, this MAP is effective to help general image retrieval Thus effective models can be learned to improve web image annotation and retrieval Web Link:

31 Outline of Talk Introduction Current Approaches Towards Higher Level Visual Representation Current Developments Conclusions Looking Into the Future Recent Research o Towards building higher-level representations for visual contents o Various vision-oriented i i approaches o Potential in understanding human visual cognition o DON T forget about text (tags) and related knowledge Great Opportunity o Available of clean and large evolving datasets makes large scale efforts feasible Acceleration of Efforts towards visual ontology and vocabulary o Extend to other media types o Idea of multimedia dictionary and vocabulary o Towards making analysis and retrieval of non-text contents as simple as text 31

32 63 THANK YOU 32

Columbia University High-Level Feature Detection: Parts-based Concept Detectors

TRECVID 2005 Workshop Columbia University High-Level Feature Detection: Parts-based Concept Detectors Dong-Qing Zhang, Shih-Fu Chang, Winston Hsu, Lexin Xie, Eric Zavesky Digital Video and Multimedia Lab