Frequent Inner-Class Approach: A Semi-supervised Learning Technique for One-shot Learning

Size: px

Start display at page:

Download "Frequent Inner-Class Approach: A Semi-supervised Learning Technique for One-shot Learning"

Jonah Walsh
5 years ago
Views:

1 Frequent Inner-Class Approach: A Semi-supervised Learning Technique for One-shot Learning Izumi Suzuki, Koich Yamada, Muneyuki Unehara Nagaoka University of Technology, , Kamitomioka Nagaoka, Niigata , JAPAN {suzuki, yamada, unehara}@kjs.nagaokaut.ac.jp Abstract. The frequent inner-class (FIC) approach is to obtain knowledge from both supervised and unsupervised learning to recognize objects based on a small number of labeled training data, one of the implementation of one-shot learning. Irrespective of the features used to represent the data, it is necessary for FIC approach to (1) represent each data item not as a point of feature space but as various points of feature space, and (2) acquire knowledge from training data to learn useful representations of the data. In this research, each data representation is regarded as a class, or more specifically, an inner class. Also, in this paper, any training data are assumed to constitute a set of objects, each of them belonging to a class. Then, as the supervised learning proceeds, it modifies the usefulness value of each inner class and generates a higher-level inner class that consists of classes related to each other. This technique can be applied to many types of image features, including reduction of vector dimensions by neural networks. Also, recognition of classes by FIC approach is an extension of the formal representation of training data. Keywords: Classification, Data representation, Co-occurrence, Feature space, Image recognition by humans 1 Introduction Image recognition, as well as image classification, requires a large quantity of training data. Even more data are required when objects in the class have large variation. Meanwhile, humans sometimes recognize the class of objects even if the number of labeled data is very small or even one, especially when a person gets used to seeing the object. For example, a person who does not know tulip will recognize it as a flower by training with only one labeled data item, if the person gets used to seeing the flower shape of the tulip. This is known as one-shot learning [1], in computer vision, aims to learn information about object classes from a small number of training data, by transferring knowledge [2] of intra-class variabilities of learnt classes through model parameters, sharing parts, or contextual information. The intra-class variabilities include object variations by lightning changes, viewpoint changes, and occlusions. This paper introduces the frequent inner-class (FIC) approach for oneshot learning carrying out by collecting a large number of data representations (inner 859

However, this difference does not invalidate the following two requirements of human recognition on FIC approach.

2 classes) that occur frequently on numerous unlabeled data, and by identifying the best matched inner classes to the unknown object. The features useful for computer vision may not always be the same as those that a human would perceive from the object. However, this difference does not invalidate the following two requirements of human recognition on FIC approach. First, each object is represented not as a point of feature space but as various points of feature space. This is not the same as having many types of features or detecting many local features in one object. Each point of feature space is referred to as a data representation and corresponds to a class. For example, from Figure 1(a), a human can perceive various data representations, or classes, such as By using object segmentation, (1) calculator or (2) keychain; By outlines, (3) rectangle, (4) rounded rectangle, or (5) button (a shape with more detail); By texture, (6) metallic; and By combining data representations, (7) class of combination calculator and keychain or (8) class of calculator and keyring combined in a specific way. Second, simply collecting data representations from unlabeled images is not enough for FIC approach. Knowledge must first be acquired from various training data to select useful data representations. From the above discussion about Figure 1(a), classes 3 and 6 are too common and classes 5 and 8 are too rare, giving these classes little chance of being useful; meanwhile, the other classes have a chance to be useful data representations. Class 1, calculator, is selected if both images in Figure 1 are assigned to the same class. In this way, the FIC approach is a semi-supervised learning [2] that uses both unlabeled data and training data, such as labeled data. Also, the FIC approach is an application of feature learning [3], in the sense that it discovers useful features. (a) (b) Fig. 1. Object (a) has aspects of at least two classes: calculator and keychain. (All the images used in this paper are from the Caltech-256 data set [4].) 860

3 2 Frequent Inner-class (FIC) Approach One data point on a feature space is referred to as data representation. Given unlabeled data and training data, FIC approach is a semi-supervised learning to discover more useful data representations from various features of an object. The following are three examples of training data: 1. Objects of the same class. The class label is not necessary. This example includes various appearances of the same object caused by different views or illumination or a group of similar but different objects, as in Figure 2(a-c), as well as dissimilar objects of the same class, as in Figure 2(d). 2. Suggested data representations. For example, a labeled image of lenticular cloud suggests an important data representation for its class label. 3. Rare but important objects. For example, a person can regard an animal as dangerous and important owing to experience if the person was bitten by the animal. Only the first example is considered in this paper. 3 Related Techniques The purpose of proposing this technique is to optimize data representation, and it can be applied to image classifications such as classification performed by a support vector machine (SVM). As a matter of fact, the recognized classes of FIC approach can be regarded as an extension of formal data representation of training data; i.e., one object corresponds to one data representation. The difference between the extraction of many data representations in this technique and the extraction of many local features, such as scale-invariant feature transform (SIFT) [5], is that the local features are transformed to a data representation by a pooling process [6], such as making a histogram using the bag-of-features (BOF) method [7]. Convolutional neural networks (CNNs) [8] also explore optimal data representation from among various data representations of an object by optimizing network parameters. However, optimized parameters of CNN are parameters specialized for representing the objects of several classes. On the other hand, applying importance values to classes in FIC approach is equivalent to the use of weights in text data processing, of which the term frequency inverse document frequency (TF-IDF) method is a well-known example. In fact, the learning by unlabeled data determines the importance value, which corresponds to the term frequency, and the learning by labeled data determines other importance values, which correspond to the inverse document frequency. The semi-supervised learning and FIC approach are different in their purpose. The purpose of semi-supervised learning is to draw the class boundary onto the feature space, whereas the purpose of FIC approach is to add weight to data representations on the feature space. By combining data representations, the co-occurrence of features can be taken into consideration. Part-based models also consider the co-occurrence of features. For example, the constellation model [9] represents the object face by describing the 861

4 spatial assignment of parts, such as eye, mouth, and nose. Nowozin et al. [10] proposed a technique to find the best combination of visual words for image processing. A more recent example is the sparse visual bigram [11], which represents the co-occurrence of any pair of neighboring visual words. 4 Feature Space The feature space has no particular restriction except that 1) each object must be represented not as a point of feature space but as various points of feature space, and that 2) a data representation is made by combining or spatially assigning more than one data representation. If a data representation is made by more than one data representation, then data representations can be almost endlessly generated from one object. Although it ignores any spatial information, the BOF method satisfies this condition. Dimension reductions by neural networks, such as CNNs, which form data representations from pixelated images by repeated pooling and sampling processes, also satisfy this condition. This is the same as forming a data representation by simply scaling down the pixelated image. Designing the feature space more effectively for the FIC approach is a challenge for the future. 5 Method for FIC Approach Unsupervised learning and supervised learning can be carried out in parallel using the following overall approach. Knowledge The data representations are collected by unsupervised learning and then quantized by combining similar representations to make inner classes. The collection of inner classes with their usefulness values is the knowledge to be acquired. Inner classes are low level if they occur frequently on unlabeled data or high level if they occur rarely on unlabeled data. Unsupervised Learning (with unlabeled data) Extract all (or, depending on the case, sampled) data representations from each data item to create the knowledge. Higher-level inner classes are given a higher usefulness value. Extremely high-level inner classes are deleted, since they are less useful. 862

5 Supervised Learning (with training data, carried out with the recognition process) The usefulness value of the recognized inner classes is increased, while the usefulness values of other classes are slightly decreased. If more than one class is recognized, Generate a higher-level inner class equal to the set of recognized classes. In this generation, the related classes, such as co-occurring classes, are kept track of. If a label is pasted on the training data, then paste the same label on recognized inner classes. Delete extremely useless inner classes. Recognition When the data constitute one object, Extract data representations from the unknown object, and then list all the matched inner classes. Regard some of the most useful (and therefore, perhaps, higher-level) inner classes from among matched inner classes as the recognized classes. (The process that decides one class is not discussed in this paper; perhaps humans make the decision using other information, such as the surroundings.) When the data contain more than one object, If a class is common to every object, then this class is the recognized class and this set of objects is said to be recognizable. If the data are not recognizable, Make subsets of the objects so that each of the subsets is recognizable and that the union of the subsets equals the data. For example, make the subsets {a, b}, {b, c}, and {d} for the objects in Figure 2. Another condition might be required for making subsets; however, the authors are still investigating this. The recognized classes to be output are (1) the recognized classes for each subset and (2) the subsets, i.e., the inner classes. 6 Extended Training Data for Classifier During the recognition process, when the data contain more than one object, the case where two objects do not share a common subset can be interpreted as the impressions are different between two objects even though they belong to the same class. Examples of this are (1) the objects a and c and (2) the object d and the other objects in Figure 2. This case also arises if the feature space design is inadequate. However, the recognized classes are still useful as training data. This is because the formal representation of training data, where each object has only one data representa- 863

tion, is made by taking obvious singleton sets as the subsets, and by choosing one recognized

The impressions are different between (d) and the rest. Fig. 3.

Recognized classes are marked by (a), (b), (c), and (d).

The formal representation of training data, where one data representation corresponds to one

6 tion, is made by taking obvious singleton sets as the subsets, and by choosing one recognized class in each singleton set (see Figure 3). (a) (b) (c) (d) Fig. 2. Objects of the class mailbox. The impressions are different between (d) and the rest. Fig. 3. Illustration of the recognized classes in each object of Fig. 2. Recognized classes are marked by (a), (b), (c), and (d). Hatched symbols are the recognized classes of {a, b} and {b, c}. The formal representation of training data, where one data representation corresponds to one object, is marked by a checkmark ( ). If the feature space design is inadequate, then some subsets, {b, c} for example, may not be recognizable. Instead, the data representation turns back to the formal representations, perhaps the checkmarks. 864

7 7 Simulation We are preparing a simulation to check the effectiveness of FIC approach. The results will be reported as soon as possible. Actual images are not used in the model. Instead, data representations are synthesized, and then various data representations are generated by combining the representations. As a matter of course, spatial information is not considered in the model: The feature space is composed of (1) N elements, referred to as visual-words, as elementary classes and (2) less-than-or-equal-to M-combinations from the set of N elements. Therefore, the cardinality of feature space is MM mm=1 CC(NN, mm), where C(N,m) is the number of m-combinations from the set of N elements. An image comprises R different visual-words and their combinations. Whether two images belong to the same class is determined by the number of visual-words common to the two images. The following issues are to be examined by the simulation. How the deletion of less useful inner classes affects the performance. Deletion is necessary to prevent overexpansion of the number of inner classes. The effectiveness of supervised learning, that is, whether the recognition rate increases for objects belonging to the class already learned using training data. Furthermore, the authors expect that the recognition rate also increases for the objects of classes that are not learned yet. The best order of learning by comparing two modes: (1) unsupervised learning is separately followed by supervised learning and (2) unsupervised and supervised learning are carried out in parallel. What kind of training data should be learned, and in what order should it be learned? For example, does learning the same data repeatedly affect performance? 8 Conclusion This paper introduced frequent inner-class (FIC) approach for one-shot learning, the recognition of an object with few labeled data by acquiring knowledge, in advance, from a large amount of unlabeled data. This technique is one of the implementation of one-shot learning. Because knowledge must be acquired from training data to select useful features, the FIC approach is semi-supervised learning. This technique is expected to improve the performance of image classifiers, such as SVM, since the recognized classes of FIC approach can be regarded as an extension of formal data representation, i.e., one object corresponding to one data representation. The authors intend to carry out the simulation as soon as possible and then perform experiments with actual images. Our priority is to design the feature space effectively, which is key to the FIC approach of one-shot learning. 865

8 References 1. Fei-Fei, L., Fergus, R., Perona, P.: One-Shot learning of object categories. IEEE Trans. Pattern Analysis and Machine Intelligence, Vol. 28(4), pp , (2006) 2. Miller, E., Matsakis, N., Viola, P.: Learning from one example through shared densities on transforms. In CVPR, v(1), pp , (2000) 3. Chapelle, O., Schölkopf, B., Zien, A.: Semi-Supervised Learning. MIT Press (2006) 4. Bengio, Y., Courville, A., Vincent, P.: Representation Learning: A Review and New Perspectives. IEEE Trans. PAMI, special issue Learning Deep Architectures. 35, pp (2013) 5. The Caltech 256. Web. May 2017, 6. Lowe, D. G.: Distinctive Image Features from Scale-invariant Keypoints. Int. Journal of Computer Vision, Vol. 60, No. 2, pp , (2004) 7. Boureau, Y.L., Bach, F., LeCun, Y., Ponce, J.: Learning Mid-Level Features For Recognition. In: CVPR (2010) 8. Csurka, G., Bray, C., Dance, C., Fan, L.: Visual Categorization with Bags of Keypoints. In: Workshop on Statistical Learning in Computer Vision, European Conference on Computer Vision, pp (2004) 9. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradientbased learning applied to document recognition. In: Proceedings of the IEEE, 86(11), pp (1998) 10. Weber, M., Welling, M., Perona, P.: Unsupervised Learning of Models for Recognition. In: Proc. of the 6th European Conference on Computer Vision (ECCV 00), Part I, pp (2000) 11. Nowozin, S., Tsuda, K., Uno, T., Kudo, T., Bakir, G.: Weighted Substructure Mining for Image Analysis. In: CVPR 07 (2007) 12. Jiang, Y. G., Yang, J., Ngo, C.W., Hauptmann, A.G.: Representations of Keypoint-Based Semantic Concept Detection: A Comprehensive Study. IEEE Trans. On Multimedia 12(1), pp (2010) 866

String distance for automatic image classification

String distance for automatic image classification Nguyen Hong Thinh*, Le Vu Ha*, Barat Cecile** and Ducottet Christophe** *University of Engineering and Technology, Vietnam National University of HaNoi,