From Pixels to Information Recent Advances in Visual Search

Size: px

Start display at page:

Download "From Pixels to Information Recent Advances in Visual Search"

Jade Reed
5 years ago
Views:

1 From Pixels to Information Recent Advances in Visual Search Bernd Girod Stanford University

3 Augmented Reality 3

4 Augmented Reality

5 Future: Smart Contact Lenses Sight: Contact Lenses with Augmented Reality [E. May-raz and D. Lazo, 2012] 5

6 Recognizing What the User Sees The Touring Machine [Feiner et al., 1997] 6

7 Stanford Landmark Recognition (2007) G. Takacs et al., ACM MIR

8 Recognizing Objects 8

9 Thomas Hill Jane Stanford Leland Stanford, Jr. Who s who? 9

10 10

11 Image-based Retrieval Levon Helm: Dirt Farmer $13.85 $

12 Outline Review: Computer vision for image-based retrieval Invariant local image features (SIFT); matching feature descriptors MPEG CDVS Standard: Compact Descriptors for Visual Search CDVS framework & pipeline; Fisher vectors as global descriptors Current research directions Query-by-image video retrieval; interframe compression of local and global descriptors 12

13 Standing on the Shoulders of Intelligent signal processing System architecture Coding & Communication Human interface Bernd Girod: From Pixels to Information Recent Advances in Visual Search 13

Local Image Features Vectors that describe

change Rotation (Affine distortion) Scale

14 Local Image Features Vectors that describe local patterns in a way that is both distinctive and invariant to Brightness changes Contrast changes Shift in x,y Scale change Rotation (Affine distortion) Scale Invariant Feature Transform (SIFT) [Lowe, 1999, 2004] 14

15 Local Features: Keypoint Detection Grayscale Color Image DoG Filter σ = scale y x Response Scale Space 15

16 Local Features: Keypoint Detection Detect Extrema Extrema Oriented in Response Feature Keypoints Scale Space scale y x Response Scale Space 16

17 Local Features: Descriptor Computation Canonical Image Patches Oriented Feature Keypoints Image Gradients Gradient Orientation Histograms 17

18 Matching Local Feature Descriptors Numerical Recipes in C SFCity CityHall Hall SF 400 Van Van Ness Ness Ave. Ave. 400 (415) (415) Bernd Girod: From Pixels to Information Recent Advances in Visual Search 18

Low Motion Time Compensate Camera Pose Send ID and Geometry

19 Mobile Augmented Reality Server Extract Features Query VocTree Check Geometry Send Query Frame High Motion Track Camera Pose Low Motion Time Compensate Camera Pose Send ID and Geometry Display ID and Draw Boundary John Mayer Inside Wants Out Network Client 19

20 Media Cover Recognition Nokia N95 Smartphone 20

21 Recognizing Books on A Shelf Motorola Droid Smartphone 21

Ave. (415) 554-6079 Wireless Network Information

22 Architecture A: Send Image 20 kbps à 20 sec Image Numerical Recipes in C SF City Hall 400 Van Ness Ave. (415) Wireless Network Information Camera Client Server Feature Extraction Feature Matching 22

23 Architecture B: Send Features Features Numerical Recipes in C SF City Hall 400 Van Ness Ave. (415) Wireless Network Information Camera Feature Extraction Feature Coding Client Server Feature Matching 23

24 Architecture C: Features on Mobile Device Numerical Recipes in C SF City Hall 400 Van Ness Ave. (415) Wireless Network Features Information Camera Feature Extraction Feature Matching Client 24

25 CDVS Standardization Moving Picture Experts Group (MPEG - ISO/IEC JTC1 SC29 WG11) initiated the Compact Descriptors for Visual Search (CDVS) standard activity at the 91st MPEG meeting (Kyoto, Jan. 2010). Final Draft of International Standard (FDIS) 25

26 CDVS Evaluation Framework Graphics Paintings Video Frames Landmarks Common Objects 26

27 ` 1M Distractor Images 27

28 CDVS Pipeline LoG peaks Non-orthogonal transform + quantization xy-location needed for object location (and geometric verification) Query Statistically optimized based on peak response, scale, location, SIFT descriptor 304, 384, 404, 1117, 1117, 1117 bytes 512, 1K, 2K, 4K, 8K, 16K bytes 28

29 Local Feature Descriptor Aggregation Nearest-neighbor matching of variable-size sets of local features is costly Compare images based on a global binary signature of constant size ( hash ) instead Naïve: VQ of feature vectors to generate histogram, compare non-empty histogram bins ( bag of features, bag of visual words ) Better: binarize gradient of log likelihood of w.r.t. to parameter vector ( Fisher vector ) 29

Fisher Vector Discriminative score function d-dimensional vector d k k-dimensional

mean (and variance) of Gaussian clusters For GMM, feature scores U(X) are soft-assigned

feature scores of an image are Fisher vector that can be used to compare images

30 Fisher Vector Discriminative score function d-dimensional vector d k k-dimensional feature vector d Parameters Typical, we use Gaussian mixture model (GMM) for Parameters : mean (and variance) of Gaussian clusters For GMM, feature scores U(X) are soft-assigned distance vectors (and squared distance vectors) relative to cluster centers Sums of feature scores of an image are Fisher vector that can be used to compare images Binarization & Hamming distance comparison results in only minor performance loss ( Binarized Fisher vector ) 30

CDVS Evolution Average performance over all datasets and

software (based on SIFT) TM2 Global descriptor ( REVV )

Fisher Vector (SCFV) TM11 Technology development complete

31 CDVS Evolution Average performance over all datasets and test conditions TMuC first reference software (based on SIFT) TM2 Global descriptor ( REVV ) based on Fisher vector framework introduced TM4 Scalable Fisher Vector (SCFV) TM11 Technology development complete Reduced algorithm memory requirements from ~400 MB to ~1MB at the same time 31

32 CDVS Performance (TM11) 32

33 Architecture C: Features on Mobile Device SF City Hall 400 Van Ness Ave. (415) Wireless Network Features Information Camera Feature Extraction Feature Matching Client 33

34 On-Device Timing Measurements Samsung Galaxy S3 Smartphone 1.4 GHz Processor 1 GB RAM Database of 100K Images Frequency queries Feature extraction 32% Global signature database search 54% Time (sec) 14% Geometric verification 34

35 On-Device Demo Image Video Matching Demo Database of 100K Images Samsung Galaxy S3 Smartphone 35

36 Augmented Reality Glasses Right-eye LCD Left-eye LCD Camera Android controller 36

37 Augmented Reality Glasses 37

38 Augmented Reality Glasses 38

39 AR w/ Head-Mounted Camera [Baidu Eye, 2014] 39

40 Visual Search: Where Do We Go From Here? Query: Image Database: Images Limitations of SIFT/CDVS framework Scale to very large databases Dense text Non-planar 3d objects Database: Videos Search Dark matter of the Internet Temporal redundancy of database Asymmetric comparisons Query: Video Streaming augmented reality Exploit temporal redundancy of queries Database caching in mobile device Tracking of copies Leverage audio Largely solved 40

41 Visual Search: Where Do We Go From Here? Query: Image Database: Images Limitations of SIFT/CDVS framework Scale to very large databases Dense text Non-planar 3d objects Database: Videos Search Dark matter of the Internet Temporal redundancy of database Asymmetric comparisons Query: Video Streaming augmented reality Exploit temporal redundancy of queries Database caching in mobile device Tracking of copies Leverage audio Largely solved 41

42 Query-by-Image Video Retrieval Applications News videos: search event footage using photos Online education: search lecture videos using slides Brand monitoring: search web videos for product placement 42

43 Fisher Vector Aggregation Stanford I2V dataset, 3,800 hours of news videos, 229 query images [Araujo et al., ICIP 2015] 43

Asymmetric Comparisons Query Images Database Frames Problem becomes more pronounced with temporal aggregation Solution: omit Fisher vector components of Gaussian

44 Asymmetric Comparisons Query Images Database Frames Problem becomes more pronounced with temporal aggregation Solution: omit Fisher vector components of Gaussian clusters that the query does not visit [Araujo et al., ICIP 2015] Might have to use more Gaussian clusters to accommodate larger number of features on the database side 44

45 Visual Search: Where Do We Go From Here? Query: Image Database: Images Limitations of SIFT/CDVS framework Scale to very large databases Dense text Non-planar 3d objects Database: Videos Search Dark matter of the Internet Temporal redundancy of database Asymmetric comparisons Query: Video Streaming augmented reality Exploit temporal redundancy of queries Database caching in mobile device Tracking of copies Leverage audio Largely solved 45

46 Architecture B: Send Features Features Numerical Recipes in C SF City Hall 400 Van Ness Ave. (415) Wireless Network Information Camera Feature Extraction Feature Coding Client Server Feature Matching 46

47 Interframe Compression of Features InterframePatch Coding 1 2 Reba keypoints, frame 1 Reba keypoints, frame 2 t [Makar et al., IEEE Trans. Image Processing, 2014] 47

48 Interframe Compression of Features Interframe Descriptor Coding 1 2 Reba keypoints, frame 1 Reba keypoints, frame 2 t [Makar et al., IEEE Trans. Image Processing, 2014] 48

49 Interframe Compression of Features Differential Location Coding 1 2 Reba keypoints, frame 1 Reba keypoints, frame 2 [Makar et al., IEEE Trans. Image Processing, 2014] t 49

50 Interframe Compression of Features Matches post-ransac fps Send Descriptors, Indp. detection 10x 4x Send Patches, Temp. coherent Inter-coded patches Inter-coded descriptors Intra-coded descriptors H.264 video Send Video Intra-coded patches Send Patches, Indp. detection Send Descriptors, Temp. coherent Bit-rate (kbps) [Makar et al., IEEE Trans. Image Processing, 2014] 50

51 Temporally Coherent Keypoint Detection Conventional keypoint detection Temporally coherent Reba keypoints, frame 12 Reba keypoints, frame 21 [Makar et al., IEEE Trans. Image Processing, 2014] 51

52 Streaming MAR at ~15 kbps 52

Hybrid Query Mode Extract Local Features

Descriptors Perform Geometric Verification

uplink Descriptor Database (Mobile) 0.62 0.

labels and local features for top-ranked

53 Hybrid Query Mode Extract Local Features Aggregate Global Descriptor Match with Global Descriptors Perform Geometric Verification Wireless Network Send global descriptors in uplink Descriptor Database (Mobile) Match with Global Descriptors Send labels and local features for top-ranked database candidates in downlink 0.49 Descriptor Database (Cloud) 53

54 Mean Precision at Rank 1 (percent) Hybrid Query Mode Interframe coding of global descriptors with caching Solid Curves: Empirical Dashed Curves: Model 88x bitrate savings 30 fps Uplink Bitrate (kbps) Independent coding of global descriptors w/o caching 54

55 Conclusion: An Exciting Area! Mobile visual search is ready for prime-time Wide-spread use of augmented reality with HMDs probably still some years away Compression for visual matching is a key problem MPEG standardization Compact Descriptors for Visual Search (CDVS) Video is next: MPEG-CDVA Akin to video coding 1980 still mostly uncharted territory. Intelligent signal processing Coding & Communication System architecture Human interface 55

56 Bernd Girod: From Pixels to Information Recent Advances in Visual Search 56

Improved Coding for Image Feature Location Information

Improved Coding for Image Feature Location Information Sam S. Tsai, David Chen, Gabriel Takacs, Vijay Chandrasekhar Mina Makar, Radek Grzeszczuk, and Bernd Girod Department of Electrical Engineering, Stanford