Xian-Sheng Hua ( 华先胜 )

Size: px

Start display at page:

Download "Xian-Sheng Hua ( 华先胜 )"

Moses Brice Hart
5 years ago
Views:

1 Machine Learning and Applications Workshop 2009 Xian-Sheng Hua ( 华先胜 ) Lead Researcher, Media Computing Group, Microsoft Research Asia Nanjing China Nov 7-8, 2009

2 Introduce research problems in multimedia search area that are related to machine learning Introduce exemplary research efforts of my team along this direction (how we find/analyze/formulate/solve the problems using machine learning in this area) 2

3 Introduction Internet Multimedia Search Machine Learning in Internet Multimedia Search and Mining Learning in internet multimedia indexing Learning in internet multimedia ranking Learning in internet multimedia mining Discussion Microsoft Confidential 3

4 How many images on the Internet? And how about videos? 4

5 4 billion (June 2009) 128 years to view all of them (1s per image) ~4000 uploads/minute 2% Internet users visit Daily time on site: 4.7 minutes 200 million (July 2009 ) 1000 years to see all of them ~20 hours uploaded/minute 20% Internet users visit Daily time on site: 23 minutes 2007 bandwidth = entire Internet in 2000 March 2008: bandwidth cost US$1M a day 12% copyright issue 15 billion (April 2009 ) 480 years to view all of them (1s per image) ~22000 uploads/minute 24% Internet users visit Daily time on site: 30 minutes 5

6 Search Navigation Sharing Authoring/Editing Copy Detection Recommendation Tagging Mining Streaming Summarization Visualization Advertising Categorization Forensics Media on Mobiles 6

7 Query Interface Ranking Indexing User Query Index Data Results Presentation 7

8 Indexing Understand and describe the content, and then build index Ranking Order relevant results according to the relevance to the query/intent Query Interface How to effectively express the query input/intent Result Presentation How to effectively present the search results 8

9 Introduction Internet Multimedia Search Machine Learning in Internet Multimedia Search and Mining Learning in internet multimedia indexing Learning in internet multimedia ranking Learning in internet multimedia mining Discussion Microsoft Confidential 9

10 Q.-J. Qi, X.-S Hua, Y. Rui, et al. Correlative Multi-Label Video Annotation. ACM Multimedia 2007 (Best Paper Award). 10

11 Automatic 90 Pure Content Based (QBE) Learning-Based Recently Automated Annotation Manual Manual Labeling Year 11

12 A typical strategy Individual Concept Detection Annotate multiple concepts separately Low-Level Features Outdoor Face Person People- Marching Road Walking- Running -1 / 1-1 / 1-1 / 1-1 / 1-1 / 1-1 / 1 12

13 Person Street Building Beach Mountain Crowd Outdoor Walking/Running? Marching 13

14 Another typical strategy Fusion-Based Context Based Concept fusion (CBCF) Low-Level Features Outdoor Face Person Road People- Marching Walking- Running Score Score Score Score Score Score -1 / 1-1 / 1-1 / 1-1 / 1-1 / 1-1 / 1 Concept Model Vector Concept Fusion -1 / 1-1 / 1-1 / 1-1 / 1-1 / 1-1 / 1 S. Godbole and S. Sarawagi. Discriminative methods for multi-labeled classification. PAKDD,

15 Low-Level Features Outdoor Face Person People- Marching Road Walking- Running Score Score Score Score Score Score Concept Model Vector Concept Fusion -1 / 1-1 / 1-1 / 1-1 / 1-1 / 1-1 / 1 15

16 A better strategy Integrated Concept Detection Correlative Multi-Label Learning (CML) Low-Level Features Outdoor Face Person People- Marching Road Walking- Running -1 / 1-1 / 1-1 / 1-1 / 1-1 / 1-1 / 1 Guo-Jun Qi, Xian-Sheng Hua, et al. Correlative Multi-Label Video Annotation. ACM Multimedia

17 A better strategy Integrated Concept Detection Correlative Multi-Label Learning (CML) Guo-Jun Qi, Xian-Sheng Hua, et al. Correlative Multi-Label Video Annotation. ACM Multimedia

18 How to model concepts and the correlations among concept in a single step Our Strategy Converting correlations into features. Constructing a new feature vector that captures both The characteristics of concepts, and The correlations among concepts 18

19 Notations 19

20 Modeling concept and correlations 2K D K 1 20

21 Learning the classifier Misclassification Error Loss function Empirical risk Regularization Introduce slack variables Lagrange dual Find solution by SMO 21

22 Connection to Gibbs Random Field Define a random field consists of all adjacent sites, that is, this RF is fully connected is a random field Define energy function Define GRF Rewrite the classifier 22

23 Connection to Gibbs Random Field Define a random field consists of all adjacent sites, that is, this RF is fully connected is a random field Define energy function Define GRF Rewrite the classifier Intuitive explanation of CML 23

24 Experiments TRECVID 2005 dataset (170 hours) 39 concepts (LSCOM-Lite) Training (65%), Validation (16%), Testing (19%) 24

25 Experiments TRECVID 2005 dataset (170 hours) 39 concepts (LSCOM-Lite) Training (65%), Validation (16%), Testing (19%) CML (MAP=0.290) improves IndSVM (MAP=0.246) 17% and CBCF (MAP=0.253) 14% SVM CML 17 CBCF CML 14 SVM CBCF CM 25

26 Experiments TRECVID 2005 dataset (170 hours) 39 concepts (LSCOM-Lite) Training (65%), Validation (16%), Testing (19%) CML (MAP=0.290) improves IndSVM (MAP=0.246) 17% and CBCF (MAP=0.253) 14% SVM CML 131% CBCF CML 128% SVM CBCF CML 26

27 Experiments TRECVID 2005 dataset (170 hours) 39 concepts (LSCOM-Lite) Training (65%), Validation (16%), Testing (19%) CML (MAP=0.290) improves IndSVM (MAP=0.246) 17% and CBCF (MAP=0.253) 14% SVM CBCF CML SVM CBCF CML SVM CBCF CML 27

28 Experiments TRECVID 2005 dataset (170 hours) 39 concepts (LSCOM-Lite) Training (65%), Validation (16%), Testing (19%) CML (MAP=0.290) improves IndSVM (MAP=0.246) 17% and CBCF (MAP=0.253) 14% 28

29 Multi-Instance Multi-Label Learning Zhengjun Zha, Xian-Sheng Hua, et al. Joint Multi-Label Multi-Instance Learning for Image Classification. CVPR Semi-Supervised Multi-Label Learning Jingdong Wang, Yinghai Zhao, Xiuqing Wu, Xian-Sheng Hua: Transductive multi-label learning for video concept detection. Multimedia Information Retrieval 2008 Multi-Label Active Learning Guo-Jun Qi, Xian-Sheng Hua, et al. Two-Dimensional Active Learning for Image Classification. CVPR Online Multi-Label Active Learning Xian-Sheng Hua, Guo-Jun Qi. Online Multi-Label Active Annotation. ACM Multimedia Guo-Jun Qi, Xian-Sheng Hua, et al. Two-Dimensional Multi-Label Active Learning with An Efficient Online Adaptation Model for Image Classification. T-PAMI

30 Sky Water Mountain Sands Scenery Zhengjun Zha, Xian-Sheng Hua, et al. Joint Multi-Label Multi-Instance Learning for Image Classification. CVPR

31 Multi-Label Learning Multi-Instance Learning Multi-Label Multi-Instance Learning Zhengjun Zha, Xian-Sheng Hua, et al. Joint Multi-Label Multi-Instance Learning for Image Classification. CVPR

32 Notations input pattern image label region label 32

33 Formulation: Hidden Conditional Random Field Association between a region and its label Spatial relation between region labels ( is neighbor or not) Coherence between region and image labels (MIL) Correlations of image labels (MLL) Learning: EM Inference: Maximum Posterior Marginals (MPM) Zhengjun Zha, Xian-Sheng Hua, et al. Joint Multi-Label Multi-Instance Learning for Image Classification. CVPR

34 34

35 Outdoor Water Sea People Crowd Sky Cloud Y Y Y N N Y Y Y N N N N Y N Y N N Y Y N N N N N Y N N N (Single-Label Active Learning for Multi-Label Problems) 35

36 Outdoor Water Sea People Crowd Sky Cloud Y N Y N N Y N Y N Y N N N Y N Guo-Jun Qi, Xian-Sheng Hua, et al. Two-Dimensional Active Learning for Image Classification. CVPR

37 Sample dimension Label dimension x 1 x 2 x 3 x 4 +???????-? +???? To minimize multi-labeled Bayesian classification error. Sample-label selection criterion: m (, ) arg max (, ) ( ;, ) xs ys H ys yl( xs) xs MI yi ys yl( xs) xs xs P, ys U( xs ) i 1, i s Self entropy Mutual information between labels Guo-Jun Qi, Xian-Sheng Hua, et al. Two-Dimensional Active Learning for Image Classification. CVPR

Guo-Jun Qi. Online Multi-Label Active Annotation. ACM Multimedia 2008. Guo-Jun Qi, Xian-Sheng Hua, et al.

38 Online Learner Preserve existing knowledge Comply with the new coming examples Old Multi-Label Model Preserve existing knowledge New Multi-Label Model Comply with the new examples New coming examples Xian-Sheng Hua, Guo-Jun Qi. Online Multi-Label Active Annotation. ACM Multimedia Guo-Jun Qi, Xian-Sheng Hua, et al. Two-Dimensional Multi-Label Active Learning with An Efficient Online Adaptation Model for Image Classification. T-PAMI

39 Minimize KLD between old model and new one ˆ 1 1 ( y x) arg min KL( ( y x) ( y x)) 1 P P D P p P Comply with multi-label constraints s. t. y y,1 i m i 1 i i P P y y y y,1 i j m 1 i j P i j P ij y x y x 1,1 i m,1 l d i l P i l P il y P 1 ( y x) 1 C n n n i ij il i 2 i j 2 i, l 2 Xian-Sheng Hua, Guo-Jun Qi. Online Multi-Label Active Annotation. ACM Multimedia Guo-Jun Qi, Xian-Sheng Hua, et al. Two-Dimensional Multi-Label Active Learning with An Efficient Online Adaptation Model for Image Classification. T-PAMI

40 Online vs. Offline On multi-label scene dataset: 2407 images, 6 labels Performance is very close (F1 score differences are less than 0.001) TRECVID Data 40

41 Single-Label vs. Multi-Label Active Learning On TRECVID dataset: 2006 Dev set, shots, 39 concepts Initial labeling: 10000, each step sample-label pairs 41

42 Adding new labels On TRECVID dataset: 2006 Dev set, shots, 39 concepts Initial labeling: 10000, each step sample-label pairs Building Explosion-Fire 42

43 Introduction Internet Multimedia Search Machine Learning in Internet Multimedia Search and Mining Learning in internet multimedia indexing Learning in internet multimedia ranking Learning in internet multimedia mining Discussion Microsoft Confidential 43

44 Content-Aware Ranking Content-Aware Reranking Text based Ranking 44

45 X. Tian, L. Yang, J. Wang, Y. Yang, X. Wu, X.-S. Hua. Bayesian Video Search Reranking. ACM Multimedia

46 The difficulty of text based search Incorrect surrounding text Ambiguity of words 46

47 Reorder the ranking list from visual consistency Based on initial text-based search results To exploit the intrinsic visual pattern 47

48 Existing methods Pseudo relevance feedback [Yan,03][Liu, 08] Selecting training samples from initial result Training a ranking model Random walk [Hsu, 07][Liu, 07] Propagating the scores Problems: Key problems not well studied How to use the initial text-based search result How to mine the visual pattern 48

49 Objective To maximize the posterior probability of the ranking list Images/video shots in the rank list initial rank list. The prior The likelihood 49

50 Visually similar samples have close ranks min E( r) r 50

51 Point-Wise Distance Dist(r 1,r 0 ) = 0.63 Dist(r 3, r 0 ) =

52 Pair-Wise Distance Simple method: Count the number of inconsistent pairs Dist(r 1,r 0 ) = 10 Dist(r 3, r 0 ) = 0 52

53 Absolute Rank Difference The difference of two ranks Absolute-Difference Distance Reranking Objective function Solved using quadratic programming 53

54 Relative Rank Difference Relative-Difference Distance Reranking Objective function Solved by Matrix Analysis 54

55 Method TRECVID 2006 TRECVID 2007 MAP Gain MAP Gain Text Baseline Random Walk % % Ranking SVM PRF % % GRF [Zhu, 03] % % LGC [Zhou, 03] % % AD Reranking % % RD Reranking % % Dataset TRECVID 2006 & 2007 Text Baseline Visual Feature Measurement Okapi BM-25 5*5 ColorMoment Average Precision 55

56 Initial Reranked 56

57 Initial Reranked 57

58 Bo Geng, Linjun Yang, Xian-Sheng Hua. Learning to Rank with Graph Consistency. MSR Technical Report,

59 Assumptions Visual Consistency The ranks among visually similar images should be consistent Ranking Consistency The reranked results and the initial text-based one should not be too different Two-step solution: text-based search and then content-based reranking 59

60 Why Content-Aware Ranking Textual information is often noisy Reranking suffer from error propagation and depends severely on initial results. Solution Introduce the visual consistency into the ranking model Bo Geng, Linjun Yang, Xian-Sheng Hua. Learning to Rank with Graph Consistency. MSR Technical Report, 2009.

61 Solution Based on Structural SVM framework Cutting-plane algorithm to solve the optimization problem Results

62 Introduction Internet Multimedia Search Machine Learning in Internet Multimedia Search and Mining Learning in internet multimedia indexing Learning in internet multimedia ranking Learning in internet multimedia mining Discussion Microsoft Confidential 62

63 L. Wu, X.-S. Hua, N. Yu, W.-Y. Ma, S. Li. Flickr Distance. ACM Multimedia 2008 (Best Paper Candidate).

64 Multimedia Information Recommendation Annotation Clustering Indexing Ranking Retrieval 64

65 Indexing Annotation Recommendation Image Similarity/ Multimedia Distance Information Retrieval Concept Similarity/ Distance Ranking Clustering 65

66 Image Similarity/Distance Image Similarity/ Distance Concept Similarity/ Distance 66

67 Image Similarity/Distance Numerous efforts have been made. Concept Similarity/Distance Concept Similarity/ Distance 67

68 Image Similarity/Distance Numerous efforts have been made. Concept Similarity/Distance Olympic Cat Sports Paw Tiger More and more used, but not well studied. 68

69 WordNet Distance Google Distance Tag Concurrence Distance 69

70 WordNet 150,000 words WordNet Distance Quite a few methods to get it in WordNet Basic idea is to measure the length of the path between two words Pros and Cons Pros: Built by human experts, so close to Cons: human perception Coverage is limited and difficult to extend 70

to get and huge coverage Only reflects concurrency in textual

71 Normalized Google Distance (NGD) Reflects the concurrency of two words in Web documents Defined as Pros and Cons Pros: Cons: Easy to get and huge coverage Only reflects concurrency in textual documents. Not really concept distance (semantic relationship) 71

72 Concept Pairs Google Distance Airplane Dog Football Soccer Horse Donkey Airplane Airport Car Wheel

Mostly is sparse (> 95% are zero in the similarity matrix) Pros and Cons Pros: Cons: Images are taken

73 Image Tag Concurrence Distance (Qi, Hua, et al. ACMMM07) Reflects the frequency of two tags occur in the same images Based on the same idea of NGD Mostly is sparse (> 95% are zero in the similarity matrix) Pros and Cons Pros: Cons: Images are taken into account a) Tags are sparse so visual concurrency is not well reflected b) Training data is difficult to get similarity matrix: 500 tags similarity matrix: 50 tags 73

Concept Pairs GoogleTag Distance Concurrence Distance Airplane Dog 0.2562 Football Soccer 0.

74 Concept Pairs GoogleTag Distance Concurrence Distance Airplane Dog Football Soccer Horse Donkey Airplane Airport Car Wheel

Visually Similar similar things or things of same type

75 table tennis pingpong horse donkey car wheel airplane airport Synonymy different words but the same meaning Visually Similar similar things or things of same type Meronymy part and the whole Concurrency exist at the same scene/place 75

76 Can we mine concept distance from image content? 76

Semantic concept distance is based on human

80% of human cognition comes from visual

There are around 4 billion photos on Flickr

Flickr image has around 8 tags bear, fur,

77 Semantic concept distance is based on human s cognition To mine concept distance from a 80% of human cognition comes from visual information large tagged image collection There are around 4 billion photos on Flickr based on image content In average each Flickr image has around 8 tags bear, fur, grass, tree polar bear, water, sea polar bear, fighting, usa 77

78 Concept A: Airplane Concept B: Airport Concept Model A Concept Model B Flickr Distance (A, B) 78

FlickrDistance 0.5151 0.0315 0.4231 0.0576 0.

79 FlickrDistance Flickr Distance is able to cover the four different semantic relationships Synonymy, Visually Similar, Meronymy, and Concurrency 79

80 R1: A Good Image Collection Large High coverage, especially on real-life world With tags Real-life photos reflect human s perception Based on collective intelligence from grassroots 80

81 R2: A Good Concept Representation or Model Based on image content Can cover wider concept relationships Can handle large-concept set Concept Models Discriminative SVM, Boosting, Generative Global Feature Local Feature w/o Spatial Relation w/ Spatial Relation Bag-of-Words (plsa, LDA), 2D HMM, MRF, 81

82 VLM Visual Language Model Spatial-relation sensitive Efficient Concept Models Can handle object variations Discriminative Generative Global Feature Local Feature SVM, Boosting, w/o Spatial Relation w/ Spatial Relation Bag-of-Words, 2D HMM, MRF, 82

83 I am talking about statistical language model. Unigram Model Bigram Model Trigram Model P P P wx w w 1 2 wn P wx w w w w p w w x 1 2 n x x 1 w w w w P w w w x 1 2 n x x 1 x 2 83

84 Visual Word Generation Patch Gradient Texture Histogram Hashing Visual Word Image Patch Unigram Model Bigram Model Trigram Model P w w w w P w xy mn w w w w P w w P xy mn xy x 1, y P w w w w p w w w xy mn xy x 1, y x, y 1 xy Trigram VLM is estimated by directly counting from sufficient samples of each category. To avoid the bias in the sampling, back-off smoothing method is adopted. 84

85 Why Latent-Topic Latent-Topic VLM Visual variations of concept are taken as latent topics 85

86 Comparison on Image Categorization Caltech 8 categories / 5097 images Accuracy (%) plsa (BOW) LDA (BOW) 2D MHMM SVM VLM LT-VLM Training Time (sec/image) plsa (BOW) LDA (BOW) 2D MHMM SVM VLM LT-VLM 86

87 Kullback Leibler (KL) divergence Good, but not symmetric topic distance Jensen Shannon (JS) divergence Better, as it is symmetric And, square root of JS divergence is a metric, so is Flickr Distance topic distance concept distance 87

88 Concept A: Airplane Tag search in Flickr Concept B: Airport Concept Model A LT-VLM Concept Model B Flickr Distance (A, B) Jensen-Shannon Divergence 88

89 Concept clustering Image annotation Tag recommendation 89

90 Concept Clustering 23 concepts; 3 groups (1) outer space, (2) animal and (3) sports Normalized Google Distance Tag Concurrence Distance Flickr Distance Group1 Group2 Group3 Group 1 Group2 Group3 Group1 Group2 Group3 bears horses moon space bowling dolphin donkey Saturn sharks snake softball spiders turtle Venus whale wolf baseball basketball football golf soccer tennis volleyball moon space Venus whale baseball donkey softball wolf basketball bears bowling dolphin football golf horses Saturn sharks soccer spiders tennis turtle volleyball moon Saturn space Venus bears dolphin donkey golf horses sharks spiders tennis whale wolf baseball basketball football snake soccer bowling softball volleyball 90

91 Total number of correct keywords Based on an approach using concept relation Dual Cross-Media Relevance Model (DCMRM, J. Liu et al. ACMMM 2007) On 79 concepts / 79,000 images 1200 NGD-DCMRM TCD-DCMRM FD-DCMRM NGD-DCMRM TCD-DCMRM FD-DCMRM The number of correctly annotated keywords at the first N words 91

92 To Improve Tagging Quality Eliminating tag incompletion, noises, and ambiguity 500 images / 10 recommended tags per image NGD Tag Concurrent Distance Flickr Distance 10 92

93 Why VLM divergence can estimate concept distance? Why FD works well even tags are not complete? Computer VLM: distribution of uni/bi/trigrams room patterns computer patterns other patterns TV room patterns TV patterns other patterns Office room patterns screen patterns other patterns 93

94 If we find similar patterns in the images associated with different concepts, the corresponding concept relationships can be discovered. Computer Office 94

95 95

96 Intention/Semantic Gap User Query Index Data Intention Gap Semantic Gap 96

97 Data Data+User+ Model Large-Scale Users Model User 97

98 Basic Assumptions of all automatic and semi-automatic approaches (model driven / data driven approaches) Data Tag Images have correlations Tags have correlations Tags have correlation with image content Semantics

99 Data Tag Data Tag Semantics Semantics For Model-able Tags For Un-model-able Tags 99

100 This slide is borrowed from Lei Zhang (MSR Asia) 100

101 This slide is borrowed from Lei Zhang (MSR Asia)

102 Data Tag Data Tag Semantics Semantics For Model-able Tags For Un-model-able Tags 102

103 We need Large amounts of tagged data Large amounts of users to do labeling Can be regarded as a combination of Data Tag Manual labeling Model based annotation Data driven approach Semantics 103

104 Research Challenges: Large-scale content-based indexing Large-scale active learning Large-scale online learning Data Tag Training data acquisition/analysis o Labeling psychology and incentive o Labeling quality estimation/evaluation o Label quality improvement/filtering o Labeling Interface o Anti-spam/cheating in labeling o Semantics 104

105 Machine learning has been one of the most effective approaches to enable content-aware multimedia search, including indexing, ranking, query interface and search result navigation Still difficult to bridge the semantic gap and intention gap through machine learning Combining data, user and model may be one possible way out 105

106 106

Columbia University High-Level Feature Detection: Parts-based Concept Detectors

TRECVID 2005 Workshop Columbia University High-Level Feature Detection: Parts-based Concept Detectors Dong-Qing Zhang, Shih-Fu Chang, Winston Hsu, Lexin Xie, Eric Zavesky Digital Video and Multimedia Lab