Color Based Web Images Retrieval by Text Retrieval Technique

Size: px

Start display at page:

Download "Color Based Web Images Retrieval by Text Retrieval Technique"

Blake McCoy
6 years ago
Views:

1 Color Based Web Images Retrieval by Text Retrieval Technique Course Project of CS534 Web Data Management Instructor: Prof. Meng Weiyi Final Report Submitted by Xiaozhou Wei May 07, 2001

2 Introduction: The purpose of this project study is to implement and evaluate a color based fast retrieval method of Web images which is based on text retrieval technique. Images can often be described by multiple features such as color, texture, object, collateral text, etc, in this project focuses on the color feature. The aim is for any user provided image, we want to find images in a large image database that have similar color distributions with the given image. Motivation: Human beings prefer images to convey information. The digital image data has grown rapidly with the prevalence of the Internet and the world-wide-web and the advance of multimedia technology. However, only when the image information allows efficient browsing, searching and retrieval, it can be useful to users. Thus efficient image retrieval method is demanding. The project provided us an opportunity to explore web image retrieval in some depth. Existing IR methods include text-based method and content-based method. Text-based methods are labor-intensive and subjective to index image features. And most traditional methods need large memory and data structure to store complex image features and comparable heavy calculation cost. Text Retrieval Technique Mature and easy to implement so we try to derive a way to represent colors as text terms and apply it on image retrieval. Color information represented by value terms is easy to define and store. Calculation is comparable much simple, this is a great benefits as image retrieval is usually carried out based on enormous image database. Overview: Even though there have been standard retrieval techniques for text. But most of them are not suitable for image data. There are several fundamental bases for content-based image retrieval. Visual Feature Extraction Multi-Dimensional Indexing Retrieval System Design Both the researchers in Database Management and Computer Vision are mainly contributed to image retrieval. They approach this active area from different view. A critical issue in content-based Image Retrieval is feature extraction. It is difficult to find feature vectors to capture image information as comprehensive as human perception. There are several approaches to represent an image, such as text-based features (keywords, annotations, etc.), visual features (color, texture, shape, faces, edge, etc.) Up to now, research workers are still seeking a best presentation for an image, which is still a task related computer vision and image understanding.

3 Two practical ways are considered: 1. On original space To partition/segment the images into smaller sub-regions and select subset representative constrains to represent the images 2. On feature space To extract low-level feature vectors such as color, texture and shape represent the images. Color correlogram [JHRZ97] approach proposes a new color feature for image indexing/retrieval. It includes the spatial correlation of colors to describe the global distribution of local spatial of colors. And their experiments show that color correlogram can outperform both the traditional histogram method and the recently proposed histogram refinement method for image retrieval. Color histograms appear simple and instinctive. Recent color coherent vector (CCV) method uses a histogram-refinement approach [GPRZ96], which imposes additional constraints on histogram based on matching. Histogram refinement splits the pixels in a given bucket into several classes, based upon some local property. Consideration and Approach It is much difficult to retrieval image data than text. Image data is the subjectivity of human perception, which is difficult for computer to representation. Our consideration is to draw abstract information out from images and use the traditional text retrieval methods on them. In our proposed approach we partition the color space of images into numerous buckets such that each bucket has a distinct color (or a group of closely-related colors). For example: while R, G, B values are all between 0,256, we may divide each color to 16 groups, says 0 to 15, 16 to 31, and so on. Then we will have 16 3 buckets of color groups. Based on this partition, each image can be represented as a vector of colors with weights. Each color is treated like a term in a text retrieval system. The term frequency weight of a term in a document is replaced by the percentage of the color in an image. The document frequency weight is the same. Each query can be similarly represented as a vector of colors with weights in the same color space. After the vectors are created, the problem of retrieval based on color will be the same as the problem that we had of text retrieval. In addition of implementation of this method, we may also need to compare the effectiveness of this method with a traditional color based method, and studies the relationship between bucket size and retrieval accuracy. In our plan and approach, we will use following steps to build a basic web image retrieval system:

4 1. Using Web Crawler to collect web images, build the basic image database; then refresh it periodically. Our focus is mainly on how to build a test bed, which may enable us to apply our method effectively. Existed image databases that are well categorized are considered. 2. Scan the whole database, by divide R, G, B color to groups, define each group of color as a term, calculate term weight frequency according to text retrieval technique. The way to define terms will be discussed in detail below. Do normalization as image size varies significantly. 3. For each term appeared, build an inverted file list [MWDM01] that contains all the image ids that have such term. Apparently only entries with non-zero weights should be kept, means if such color don't exist in this image, we don't consider this image at all. 4. Create a hash table for all group color represented terms, use the terms as keys, store the inverted file lists follow each key. 5. For any given query image; calculate its term frequency weight. Use the similar algorithm as text retrieval, we calculate the similarity of each image in our database to query image. 6. Eventually we sort the images in descending similarities and display the top group to the user. For a given image, each pixel is represented by a color that is mixed by RGB color, say R i, G j, B k. If we don't group colors, we will have 0 i, j, k 255. Suppose we divide each color to n groups, for example, categorize 0-15 as R 1, as R 2, and so on. Then we will have 0 i, j, k n. after we have formed such groups, we will have n 3 -grouped colors. We may define each color group as a term, apply our approach on it. This method is easy to implement but the disadvantage is obvious. We can't divide the color information into very small group, which results to extremely huge terms vector. Another approach which is under consideration is, we don't combine RGB vector together but use them separately. We will have following image-color_term matrix. B G R 1 R 2 R i R n I 1 r 11 r 12 r 1i r 1n I 2 r 21 r 22 r 2i r 2n I i r i1 r i2 r ii r in I m r m1 r m2 r mi r mn

5 For each image we will need to scan 3 times, according to R, G, B respectively. The numbers of term we have now become 3n, while we divide each color into n groups. The inverted file list will also be divided to: I ( R j ) ={(I 1, r 1j ), (I 2, r 2j ), (I i, r ij ), (I n, r nj ) } I ( G j ) ={(I 1, g 1j ), (I 2, g 2j ), (I i, g ij ), (I n, g nj ) } I ( B j ) ={(I 1, b 1j ), (I 2, b 2j ), (I i, b ij ), (I n, b nj ) } I i represent the ith image in image database, r ij, g ij, b ij represent R, G, B weight of jth color term in ith image respectively. Process on R, G, B separately will decrease the length of term vector, as well as reduce the processing time. But it will also lose accuracy. As each color is combined by different R, G, B term, single color term may not reflect the actual weight of unique mix color. Anyhow, for color which is mainly represented by only one of R, G, and B value, apply this method may still result to satisfied accuracy. We hope the implementation will help to support the analysis. For evaluation we may look for an existed image library, or fetch a number of images from web and manually divide them into several categories. Then we will retrieve images by conventional color based image retrieval method, also retrieve images by our own method, and compare the effectiveness and efficiency. Methodology and Implementation Basically we will need to build a web image database first, one way is use a web robot and recursively retrieve images from a main URL. Secondly we need to build information vectors for each image based on their color information, and use a criterion to find the similarity between any two images. To guide the research effort in the correct direction, evaluating the system performance is important. First of all, we need to establish a well-balanced large-scale test bed. It has to be large in scale to test the scalability; balanced in image content to test image feature s effectiveness and system s overall performance. [JVCIR98] To build a test bed for our image retrieval method we can either use an existed image library, or build one by ourselves. To build a web image databases we will need a web crawler who recursively search all the links start from certain web-page, parse those image related URL and use a get-image method to save contents and URL of those images in local. Only two formats of images are widely used in web environment, names GIF and JPG. It's a trivial problem to retrieve color information of each pixel from each image by C or Java. For example, in Java Toolkit class provides a method getimage:

6 Image getimage(string filename), It returns an image that gets pixel data from the specified file, whose format can be GIF, JPEG or PNG. We then define a 3-dimension matrix image [][][], which has the first two-dimension store the x and y coordinates of each pixel, with the third dimension store the RGB value assigned to this pixel. Then we decide a number n, divide R, G, and B into n groups each, get the color term frequency of each image and store them in local. The data structure we used to store all the term frequency weight information is a 2- dimension vector. They are stored as inverted file list; each element is an object that contains both the image index and the term frequency weight value. After we have a fixed image library we can use obtained information to calculate inverted document frequency weight. As this value is changed with the changing of database, we can give a threshold so that, only when the changing is higher than the threshold then we will update all the idfw, otherwise we don't. The refresh rate of idfw should be selected carefully, if too high it is time consuming, too low it can't reflect the influence of the variation of database. Based on these information we are able to get the similarities between images in our database and query image based on following equation and give the nearest matching.

7 sim( q, Ii) q Iw 1 1 = cos( q, Ii) = + m + q q I i n Iw n I = n 2 I i i= 1 1 / Ii is the normalization factor of image Ii. 1 / q is the normalization factor of query image. qi and Iwi represent the term frequency weight of ith term of query image and current image respectively Basic system diagram is shown below: Web Image Database Scan and Processing, RGB color features Store tfw & idfw Compute Similarity Return top N images Query Image Query image tfw and idfw [AMAR00] We use our test image database to get the color term frequency weight and to build inverted file list for R, G and B value. Generally one image won t have very wide scattered color value, so most term frequency weight will be 0. This can be observed from test and may considered to save space in data structure. To calculate similarities between some given query image a larger image database is expected to be built to satisfy the demand. Image Database as Test bed About two hundred images are contained in our test image database and most are in small dimension. The consideration is, by this way I can scan the image database every time but still run fast, and test on the fly. For a bigger image database apparently we will need to save all the term frequency values in the first scan and use them later. This is implemented by serialization. We saved all the image information into two data files and used them in later query. Methods

8 [Method 1] The first method implemented is based on the method I proposed in report two, which divide R, G, B value separately, construct term frequency and inverted file list for each of them respectively. For each image we scan one time but save 3 times, according to R, G, and B value respectively. The numbers of term we have are 3n, while we divide each color into n groups. The inverted file list will also be divided to: I ( R j ) ={(I 1, r 1j ), (I 2, r 2j ), (I i, r ij ), (I n, r nj ) } I ( G j ) ={(I 1, g 1j ), (I 2, g 2j ), (I i, g ij ), (I n, g nj ) } I ( B j ) ={(I 1, b 1j ), (I 2, b 2j ), (I i, b ij ), (I n, b nj ) } I i represent the ith image in image database, r ij, g ij, b ij represent R, G, B weight of jth color term in ith image respectively. In our consideration this method will lose accuracy that varied from different situation. As each color is combined by different R, G, B term, single color term may not reflect the actual weight of unique mix color. For example, image A has three equal parts, which has color (R 0, G 0, B 0 ), (R 1, G 1, B 1 ) and (R 2, G 2, B 2 ) in each part; image B has (R 0, G 1, B 2 ), (R 1, G 2, B 0 ) and (R 2, G 0, B 1 ). Two images will be wholly different if identified by human perception, but the similarity will be 1 by our method. Anyhow, for color which is mainly represented by only one of R, G, and B value, apply this method may still result to satisfied accuracy. This has been proved by my implementation. Also, in this way an image can be represented by fewer terms than we have estimated. If we divide each color to 8 segments we will get 24 terms only. So in the implementation of method one we didn t reduce the dimensions of query terms. The main class is imagequery class, in this class we processed on every image which is read in by imageio class. For each image we will: 1, Get image name list from image database directory; 2, Segment RGB to certain numbers, scan all the images, for each image calculate term frequency weight by scanning every pixel; 3, For each image, for each term segment, add an element to a vector lead by this term. Each element contains the file name and the weight of this term segment. 4, Get inverted document frequency weight as well; 5, Calculate the term frequency weight for query image according to its own weight and inverted document frequency weight,

9 6, Go through each inverted file list, calculate the similarity of each image compare to query image; 7, Return the images by descending order of similarities. [Method 2] Method two is similar as method one in most perspectives except we get n 3 terms instead of 3n terms. It s rather easy to adjust the first class the get second one. And we found the same phenomenon that lower segmentation results to better output. Meanwhile if we choose higher dimension n, we need to reduce the query terms by only choose those term which is higher than a threshold, and save enormous inverted file lists in hash table. Other parts of Method two are same as Method one. Supporting class are common parts, include qimgdisp which is use to display the query results. Qsort class is used to sort array. Imagem is a void main class which only used to declare and run instances of other classes. SerializeIfl class is used to save all the term frequency weight information into local disk, in this way, as well as un-serialize. Results Part of calculation results of a query: \imgdb\wntrmtshasta.jpg \imgdb\wntrmtmckin.jpg \imgdb\wspogolf.jpg \imgdb\wanmevehowl.jpg \imgdb\wntrctrlake.jpg \imgdb\wtrvchantilly.jpg \imgdb\wspogainer.jpg \imgdb\Sample.jpg \imgdb\wtrvcapbuild.jpg \imgdb\wpeoeyes.jpg \imgdb\wtrvbora.jpg \imgdb\wntrmatswiss.jpg \imgdb\wtrvcoliseum.jpg \imgdb\google.gif \imgdb\wsposkisla.jpg \imgdb\wntrwtrfall.jpg \imgdb\wanmdogbrk.jpg \imgdb\wspoicesk.jpg \imgdb\wtrvbasilica.jpg \imgdb\wspodiver.jpg

10 Screen results: Observation, Further Consideration and Evaluation When compare with text retrieval, some observations are,

11 Once the way to group RGB color in-groups are decided; we will have fixed number of terms in our database, no matter how many new images are added in later. Generally a query image will provide much more terms then a normal text query. How we group the colors, and define the terms will be fairly crucial for the system effectiveness and efficiency. We have estimated that by greater segmentation we may get higher accuracy on retrieval. In reality seem this estimation is not correct. Lower segmentation seem give better output when evaluated by human. Consideration is suppose we have B i and B j, both are blue color. For two images which are mainly constructed by them respectively, we will feel the similarity of this two images are very high. But if by small segmentation this two colors are assigned to Term j and Term i separately, their contribution to similarity will be 0, as we calculate similarity use a dot product we will have tfw[i] multiply 0 plus tfw[j] multiply 0, which is not well reflect the reality. This is possible the reason of some weird output in my test. By tuning the segmentation number we found that 4 or 8 best reflect the reality. Higher segmentation usually gives bad output. If we name the way we scan an image by R, G, B separately by 3N and the way we define n 3 term as N3, we found them behave quite differently on varied colors. What we mentioned here the quality of output are only evaluated by human perception. We also need to do an evaluation with a mature image retrieval technique. One classic method is using HVC to build the histogram of images and find the similarity based on area parameters of images. H, V and C are given by: H = cos 1 0.5( R G) + ( R B) ( R G) 2 + ( R B)( G B) V=R+G+B min( R, G, B) C=1- V Each image is transformed to HVC from RGB value, then divided to 4x4=16 equal areas. For each area calculates the histogram based on HVC and use them to construct a vector. Use this vector and cosine function the similarity between two images can be calculated. Until now because of the limited time all the evaluations are done by human perception, also it is only draw on a small database. Further evaluation based on metrics is expected.

12 References [MWDM01] Web Data Management Course Lecture Notes, Prof. Meng Weiyi. [JVCIR98] Yong Rui and Thomas S. Huang. Image Retrieval: Past, Present, and future [YRTS99] Yong Rui and Thomas S. Huang, Shih-Fu Chang, Image Retrieval: Current Techniques, Promising Directions and Open Issues. [JHRZ97] Jing Huang and Ramin Zabih. Image Indexing Using Color correlograms [GPRZ96] G.Pass and R.Zabih, Histogram refinement for content-based image retrieval. IEEE Workshop on Applications of Computer Vision, pages ,1996 [AMAR00] Arnold W.M. Smeulders, Senior Member, IEEE, Marcel Worring, Simone Santini, Member, IEEE, Amarnath Gupta, Member, IEEE, and Ramesh Jain, Fellow,"Content-Based Image Retrieval at the End of the Early Years", IEEE. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 22, NO. 12, DECEMBER [ACEDS] Augusto Celentano, Eugenio Di Sciascio, Feature Integration and Relevance Feedback Analysis in Image Similarity Evaluation.

13 Appendix: Usage: java classpath [classpath] TRBIRMain Image library is located at [classpath\imgdb], all the images are in GIF or JPG format. Classes included in the system and its purpose: TRBIRMain.java imagequery.java imagequery3.java InvertedFileListElement SerializeIfl.java SerializeDoc.java qimgdisp.java QuickSort.java imageio.java utils.java BMPFile.java iobserver.java The main class which invoke other supporting classes. Scan image library to get all the information such as term frequency weight, document frequency, etc. Implemented by 3N method which is mentioned in report. Slight modification add to imagequery class to implement the N3 method. This is a build in class which implements Serializable interface so we can serialize and de-serialize it later. Use this two classes to serialize and de-serialize image information to and from local disk. Extend from Jframe class to give a GUI output and call imagequery to fulfill the actual query.. Use to sort the similarity array. Image IO class, which is download from internet and be modified for our purpose. Other 3 classes are supporting classes for image IO. For the first time to run this image retrieval system you need to set the Boolean variable rescan in imagequery class to true. After we have finished the scan we will set it to false. Only when we have large change in our image database will we need to scan all the images again. Two dat files will be generated under current class path after the first scan, ifl.dat contains all the information of inverted file list. It is a Vector array with all the objects defined by InvertedFileListElement class. doc.dat is used to save Document frequency information which is used in normalization.

A Novel Image Retrieval Method Using Segmentation and Color Moments

A Novel Image Retrieval Method Using Segmentation and Color Moments T.V. Saikrishna 1, Dr.A.Yesubabu 2, Dr.A.Anandarao 3, T.Sudha Rani 4 1 Assoc. Professor, Computer Science Department, QIS College of