Bag-of-Words Computer Vision. Jeff Woodard

Size: px

Start display at page:

Download "Bag-of-Words Computer Vision. Jeff Woodard"

Lesley Nash
5 years ago
Views:

1 Bag-of-Words Computer Vision Methods for Forensic Applications February Fourier Talks The Norbert Wiener Center for Harmonic Analysis and Applications University of Maryland February 18, 2011 Jeff Woodard MITRE CORP, McLean, VA Approved for Public Release: Distribution Unlimited 1

2 Problem addressed Forensic & biometrics: writer identification Strengthening th i Forensic Science in the United States: A Path Forward (National Academies report, 2009): Computer studies suggest a scientific basis for handwriting comparison, at least in the absence of intentional obfuscation or forgery The scientific basis for handwriting comparisons needs to be strengthened 2

3 Summary of Past Automated Studies Strongly influenced by human document examiners Limited publically available reports and data bases Often manual supervision or preprocessing needed Features are linguistic and/or geometrically based 3

4 Initial Processing Photo & Digitization Scanners Digital cameras, etc Paper Document Digital Image 4

Bag-of-Words Approaches Popular in computer vision applications: Object recognition & detection Video tracking Other biometrics, e.g. face, iris, fingerprint

5 Bag-of-Words Approaches Popular in computer vision applications: Object recognition & detection Video tracking Other biometrics, e.g. face, iris, fingerprint Bag-of-words methods: Do not measure classical document features Instead, extract t local l features to model local l shape Are completely automated and unsupervised 5

6 Bag of Words Algorithmic Components Local Features: Scale Invariant Feature Transform(SIFT) Vector V Quantization i (VQ) Either: probabilistic Latent Semantic Analysis (plsa) Spatial Pyramid Matching 6

7 Images Visual Word Codebook Adapted from L. Fei-Fei, Fei, R. Fergus, A. Torralba, CVPR, 2007

8 Images: Class 1 First learning Images: Class N Then recognition Class Unknown Local Features Local Features Clustering Codebook image representations Image or Class Models Class 1 Class N Adapted from L. Fei-Fei (UIUC, 2007) Class Decision

9 Local Features 9

Local Features Popularized by D. Lowe, U.

10 Local Features Popularized by D. Lowe, U. British Col. Represent distinctive shapes of small regions without segmentation Local Descriptor Figure credit: adapted from Torralba. MIT Moderately invariant or robust to: Rotation, Scale,Viewpoint, Noise, Occlusions 10

11 Local Features: Invariance Local feature 1 detected S Transform Descriptor 1 { } Local feature 2 detected Transform Descriptor 2 { }

12 Models of Image Change Geometry Rotation Similarity (rotation + uniform scale) Affine Photometry Affine intensity change (I ai + b) ) Lowe s Scale Space Laplacian of Gaussian provides scale invariance Approximate with Difference of Gaussians (DoG) (s, o ) 0 s : # of 1.6, s 0 scales 2 ( s / S 0, per o ), DoG, S 1, o octave, ( o,s ) o min O : # of I, o ( o,s 1 ), o min octaves I o ( o,s ) O 1 Adapted from A. Torralba, MIT, 2010

13 Lowe s Scale Space Scale 1 Scale 2 Scale 3 Scale 4 Scale 5 Octave 1 Octave 2 Gaussian Representation Ot Octave 3 Find Extrema Difference of Gaussians Representation 13

14 Local Features Original Image Circle center: feature x-y position Circle radius: feature scale Circle arrow: dominant orientation(s) i DoG-SIFT features 14 14

15 Vector Quantization 15

16 Vector Quantization (VQ) i.e., K-means, Generalized Lloyd clustering Hierarchical or tree-structured for speed Many different local descriptors quantized to small codebook -- visual words Encode each visual word with unique index Images characterized by histogram of indexes 16

17 VQ VISUAL WORDS 100 random image regions from Arabic, DoG detector, SIFT descriptor, 512 codewords Region sizes vary: all displayed identically Regions for VQ codeword 1 Regions for VQ codeword 2 Regions for VQ codeword 3 17 Visual words represent fundamental properties associated with early to middle vision

18 Bag of Words Representation Image Local Features 5 Images Histogram VQ encoding 3 Counts VQ index

19 Classification by plsa Hofmann, T. (2001), Unsupervised Learning by Probabilistic Latent Semantic Analysis, Machine Learning Journal,, 42(1), 177:196. Sivic, J., Russell, B., Efros, A., Zisserman, A., & Freeman, W. (2005), Discovering object categories in image collections, MIT AI Lab Memo AIM

20 probabilistic Latent Semantic Analysis (plsa): Text Documents documents documents modeled as combinations of latent topics Text Document Latent topic Words Generative View Select a document d i with prob P(d i ) Pick latent class z k with prob P(z k d i ) d z w Generate keyword w j with prob P(w j z k ) Boxes replicate IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 50, NO. 3, MARCH 2002 A Survey of Convergence Results on Particle Filtering Methods for Practitioners Dan Crisan and Arnaud Doucet Abstract Optimal filtering problems are ubiquitous in signal processing and related fields. Except for a restricted class of models, the optimal filter does not admit a closed-form expression. Particle filtering methods are a set of flexible and powerful sequential Monte Carlo methods designed to solve the optimal filtering problem numerically. The posterior distribution of state is approximated by a large set of Dirac-delta masses samples/particles) that evolve randomly in time according to the dynamics of the model and the observations. The particles are interacting; thus, classical limit theorems relying on statistically independent samples do not apply. In this paper, our aim is to present a survey of recent convergence results on this class of methods to make them accessible to practitioners. Index Terms Bayesian estimation, optimal filtering, particle filtering, sequential Monte Carlo, state-space models. I. INTRODUCTION MANY models in signal processing can be cast in a statespace form. In most applications, prior knowledge of the system is also available. This knowledge allows us to adopt a Bayesian approach, that is, to combine a prior distribution for unknown quantities with a likelihood function relating these quantities to the observations. Within this setting, one performs inference on the unknown state according to the posterior distribution. Often the observations arrive sequentially in time and Particle Filter Count Keywords Word Index Bag of Words Word 1 Filter 5 2 Particle 2 j N Method 12 Result 1 Counts 20

object Feature index Generative View Select a

object z k with prob P(z k di) d z w Generate

Words Word Index Word Counts Writing Style

21 plsa for Images images modeled as combinations of latent objects Image Latent object Feature index Generative View Select a an image d i with prob P(d i ) Pick latent object z k with prob P(z k di) d z w Generate VQ index w j with prob P(w j z k ) Bag of Words Word Index Word Counts Writing Style Extract Features Quantize Count Visual Words L 1 21

22 probabilistic Latent Semantic Analysis (plsa) w= VQ index d = image z = object d j w P(z i k d j ) P(w i z k ) z k specify # of objects 1, Hierarchical Bayesian relation 2,..., L measure directly estimate estimate Observed VQ index distributions per image VQ index distributions per object Object distributions per image EM likelihood function Adapted from W. Freeman, MIT (2005) L ndw (, )log Pdw (, ), Pwd ( ) ndw (, ) d w n(d,w) = raw histogram 22

23 Querying PLSA: Result object overlap: probability that chosen objects in first and second images are similar: cosine similarity K sim( d, d ) P( z d ) P( z d ) i m k i m k 1 α (, ) (, ) nn L K ij mj Pzk di wj Pzk dm wj j 1 k 1 n ij nmj j j VQ index sense overlap: do both VQ indexes refer to the same object? VQ index overlap: do both images contain common indexes? retrieved class d * arg max { sim( d, d )} j i j 23 Idea adapted from slide and research by T. Hofmann, Brown Univ

24 plsa Object Discovery Ranked by max P(d z k ) = argmax P(Image Object k ) Object 1 Style = tight Object 2 Style = loose Ranked left-to-right, top to bottom Writer indexes shown in green 24

25 Classification by Spatial Pyramid Matching Lazebnik, S., Schmid, C., & Ponce, J. (2006) Beyond Bag of Features: Spatial Pyramid Matching for Recognizing g Natural Scene Categories, Proceedings, CVPR. 25

26 Local feature correspondences Optimal partial match: match all of smaller set to some of larger set Depicted are Feature Locations Compare VQ indexes m < n min xi ( xi) X Y x X : i Slide credit: B. Leibe & K. Grauman, U. Texas, Ausin

27 Pyramid match Method Optimal matching too expensive Approximate optimal partial match Optimal match: O(r 3 ) Greedy match: O(r 2 log r) Pyramid match: O(r) 27 r = number of images Adapted from slide by: B. Leibe & K. Grauman, U. Texas, Ausin

28 Pyramid match kernel: Histogram intersection X Y 28 j Take minimum of each bin Slide adapted from Lazebnik, et al., U. North Carolina

29 Spatial Pyramid Matching: Processing Flow Training Image Database Training phase SIFT Train VQ VQ Codebook Testing phase Evaluation Image Database Pyramid Histograms SIFT VQ VQ indexes Spatial Pyramid

30 Experimental Results DUTCH MITRE Dutch Group Approach plsa Validation Paradigm 10-fold cross Features Hessian- Laplace Pyramid SLOO Hessian- Affine Bulacu & Schomaker (2007) SLOO Bulacu & Schomaker (2007) Codebook Rank-1 Time Size %Error (hrs) 250 Subjects docs per subject 500 total documents Style: printed, uppercase Gray Scale ? Text Independent MITRE Approach Validation Paradigm Arabic Features Codebook size Rank-1 %Error Time (hrs) plsa SLOO Hessian-Affine Pyramid SLOO Harris-Hessian Laplace Pyramid Read Codebook SLOO Harris-Hessian- Laplace Subjects 3 docs per subject 153 total documents Style: cursive Gray Scale Text Dependent Strict Leave One Out (SLOO) 30

31 Summary First known application of generative local l computer vision models to forensics or biometrics Beats best published results on Dutch Learning is totally unsupervised Works across different languages and text Publications: Woodard, J. & Lancaster. M. Computer Vision Methods for Automated Writer Recognition (2011), to be presented at the American Academy of Forensic Sciences, 63 rd Annual Scientific Meeting, February 21-26, 2011, Chicago, IL. Woodard, J., Lancaster, M., Kundu, A., Ruiz, D., & Ryan, J. (2010) Writer Recognition of Arabic Text by Generative Local Features, Proceedings, 4th IEEE International Conference on Biometrics, Washington, DC, September,

32 Backups 32

33 References Gersho, A. & Gray, R. (1992) Vector Quantization and Signal Compression, Boston: Kluwer Academic Press. Hofmann, T. (2001) Unsupervised Learning by Probabilistic Latent Semantic Analysis, Machine Learning Journal, 42(1), 177:196. Lazebnik, S., Schmid, C., & Ponce, J. (2006) Beyond Bag of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories, Proceedings, CVPR. Sivic, J., Russell, B., Efros, A., Zisserman, A., & Freeman, W. (2005) Discovering object categories in image collections, MIT AI Lab Memo AIM Tuytelaars, T. & Mikolajczyk, K. (2007) Local Invariant Feature Detectors A Survey. Foundations and Trends, Computer Graphics and Vision, Vol. 3, No. 3, Woodard, J. & Lancaster. M. Computer Vision Methods for Automated Writer Recognition (2011) accepted for presentation at the American Academy of Forensic Sciences, 63 rd Annual Scientific Meeting, February 21-26, 2011, Chicago, IL. Woodard, J., Lancaster, M., Kundu, A., Ruiz, D., & Ryan, J. (2010) Writer Recognition of Arabic Text by Generative Local Features, Proceedings, 4th IEEE International Conference on Biometrics, Washington, DC, September,

34 Local features Distinctive Robust to changes Indirect Match Aggregate local statistics: Direct match Indirect match First Image First Object Image Local image patches corresponding To feature locations Direct Match Second Image Adapted from Dave Lowe, U. of British Columbia 34

35 Lowe detector: Scale-Space decimated Low-Pass, Gaussian filtered versions of image Bandpass Bandpass filtered versions of image 35 Borrowed from Dave Lowe, U. of British Columbia, 2004

36 Scale & Orientation Assignment Use Gaussian &DoG representations ti Dt Detected tdk keypoint it DoG image Borrowed from Yung-Yu Chuang, National Taiwan U., 2006

Scale & Orientation assignment Gradient Magnitude: m(x, y) = (L(x + 1, y)

1((L(x, y + 1) L(x, y 1))/(L(x + 1, y) L(x 1, y))) DoG Arrows Direction:

filtered image, L, at closest scale: This is the scale Gradients

37 Scale & Orientation assignment Gradient Magnitude: m(x, y) = (L(x + 1, y) L(x 1, y))2 + (L(x, y + 1) L(x, y 1))2 Gradient Orientation: θ(x, y) = tan 1((L(x, y + 1) L(x, y 1))/(L(x + 1, y) L(x 1, y))) DoG Arrows Direction: represents orientation Length: represents magnitude Take Gradient Gaussian filtered image, L, at closest scale: This is the scale Gradients superimposed over Gaussian filtered image Adapted from Yung-Yu Chuang, National Taiwan U., 2006

38 Orientation assignment 4 Rotate gradient magnitudes to dominant orientation after orientation is found 1 Form bins containing sum of weighted gradient magnitudes corresponding to orientations at that range of angles 2 Find peak 25 degrees Bin index 180 degrees 360 degrees 3Determine angle corresponding to bin This is the orientation Adapted from Yung-Yu Chuang, National Taiwan U., 2006

8 directions 16 x 8 = 128 dimensional feature

structure: Strength of Gaussian filtered image

degree slices 25 o Parameters determined

to 8 directions Filtered Image Gradients

39 SIFT Descriptor 4 x 4 Gradient window. Compute for 16 windows Histogram each window in 8 directions 16 x 8 = 128 dimensional feature vector Represents stable, local image structure: Strength of Gaussian filtered image gradient Quantized to 8 directions, or 45 degree slices 25 o Parameters determined experimentally 16 gradient magnitudes quantized to 8 directions Filtered Image Gradients rotated to dominant orientation 16 x 4 x 4 Adapted from slide by Jonas Hurrelmann 39

40 Some SIFT-Family detectors SIFT Detector Relative Strengths Relative Weaknesses Hessian-DOG (Lowe) Invariant to rotation & scale. Good for blobs. Corners, not fully affine invariant Occlusions & clutter may be an issue. Hessian-Laplace Very similar to Hessian-DOG but slightly Corners, not fully affine invariant higher location accuracy Occlusions & clutter may be an issue. Maximally Stable Affine invariant. Good for viewpoint change. Good with homogeneous regions with distinct boundaries. Good all-around. around Blur. Heterogeneous or complex-shaped regions. Hessian-Affine Blobs & ridges. Affine invariant. Good allaround across many sources of change. Good for occlusions & clutter. Harris-Affine Harris-Laplace Harris-Hessian-Laplace Blobs & ridges. Affine invariant. Good for occlusions & clutter. Invariant to rotation & scale. Good for corners. Corners. Good for occlusions & clutter. Not fully affine invariant, blobs Edge-Laplace Edges & corners. Not fully affine invariant, blobs. Based on the studies of Mikolajczyk & Schmid (2005), Mikolajczyk et. al, (2006) 40

Beyond bags of features: Adding spatial information. Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba

Beyond bags of features: Adding spatial information Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba Adding spatial information Forming vocabularies from pairs of nearby features doublets