Link Graph Analysis for Adult Images Classification

Similar documents
LINK GRAPH ANALYSIS FOR ADULT IMAGES CLASSIFICATION

Research on Neural Network Model Based on Subtraction Clustering and Its Applications

Multilabel Classification with Meta-level Features

Color Texture Classification using Modified Local Binary Patterns based on Intensity and Color Information

Boosting Weighted Linear Discriminant Analysis

A Real-Time Detecting Algorithm for Tracking Community Structure of Dynamic Networks

FUZZY SEGMENTATION IN IMAGE PROCESSING

FINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANK

Adaptive Class Preserving Representation for Image Classification

Optimizing Document Scoring for Query Retrieval

Steganalysis of DCT-Embedding Based Adaptive Steganography and YASS

Fuzzy Modeling for Multi-Label Text Classification Supported by Classification Algorithms

Connectivity in Fuzzy Soft graph and its Complement

Performance Evaluation of TreeQ and LVQ Classifiers for Music Information Retrieval

Progressive scan conversion based on edge-dependent interpolation using fuzzy logic

TAR based shape features in unconstrained handwritten digit recognition

LOCAL BINARY PATTERNS AND ITS VARIANTS FOR FACE RECOGNITION

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

Optimal shape and location of piezoelectric materials for topology optimization of flextensional actuators

Performance Analysis of Hybrid (supervised and unsupervised) method for multiclass data set

A Fast Way to Produce Optimal Fixed-Depth Decision Trees

MULTIPLE OBJECT DETECTION AND TRACKING IN SONAR MOVIES USING AN IMPROVED TEMPORAL DIFFERENCING APPROACH AND TEXTURE ANALYSIS

Pixel-Based Texture Classification of Tissues in Computed Tomography

Cluster ( Vehicle Example. Cluster analysis ( Terminology. Vehicle Clusters. Why cluster?

Bottom-Up Fuzzy Partitioning in Fuzzy Decision Trees

Gabor-Filtering-Based Completed Local Binary Patterns for Land-Use Scene Classification

Unsupervised Learning

Bit-level Arithmetic Optimization for Carry-Save Additions

Discriminative Dictionary Learning with Pairwise Constraints

Integrating Fuzzy c-means Clustering with PostgreSQL *

Computing Cloud Cover Fraction in Satellite Images using Deep Extreme Learning Machine

Pattern Classification: An Improvement Using Combination of VQ and PCA Based Techniques

Multi-scale and Discriminative Part Detectors Based Features for Multi-label Image Classification

Session 4.2. Switching planning. Switching/Routing planning

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

A Robust Algorithm for Text Detection in Color Images

Query Clustering Using a Hybrid Query Similarity Measure

Evaluation of Segmentation in Magnetic Resonance Images Using k-means and Fuzzy c-means Clustering Algorithms

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems

Region Segmentation Readings: Chapter 10: 10.1 Additional Materials Provided

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices

Performance Evaluation of Information Retrieval Systems

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Matrix-Matrix Multiplication Using Systolic Array Architecture in Bluespec

Machine Learning: Algorithms and Applications

Semi-analytic Evaluation of Quality of Service Parameters in Multihop Networks

A MPAA-Based Iterative Clustering Algorithm Augmented by Nearest Neighbors Search for Time-Series Data Streams

Multi-Collaborative Filtering Algorithm for Accurate Push of Command Information System

DETECTING AND ANALYZING CORROSION SPOTS ON THE HULL OF LARGE MARINE VESSELS USING COLORED 3D LIDAR POINT CLOUDS

The Simulation of Electromagnetic Suspension System Based on the Finite Element Analysis

Avatar Face Recognition using Wavelet Transform and Hierarchical Multi-scale LBP

Active Contours/Snakes

AVideoStabilizationMethodbasedonInterFrameImageMatchingScore

Clustering Data. Clustering Methods. The clustering problem: Given a set of objects, find groups of similar objects

Measurement and Calibration of High Accuracy Spherical Joints

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Interval uncertain optimization of structures using Chebyshev meta-models

Outline. Learning with Missing Data. Examples. Modeling Missing Values ( ) Missing Data Algorithms Experiments Conclusions Future Directions

Improving Web Search Results Using Affinity Graph

Improved Accurate Extrinsic Calibration Algorithm of Camera and Two-dimensional Laser Scanner

TN348: Openlab Module - Colocalization

Smoothing Spline ANOVA for variable screening

Classifier Selection Based on Data Complexity Measures *

Time Synchronization in WSN: A survey Vikram Singh, Satyendra Sharma, Dr. T. P. Sharma NIT Hamirpur, India

Machine Learning. Topic 6: Clustering

AUTOMATICALLY MULTIPLE FEATURES DETECTION OF FACE SKETCH BASED ON MAXIMUM LINE GRADIENT

An Adaptive Filter Based on Wavelet Packet Decomposition in Motor Imagery Classification

Clustering incomplete data using kernel-based fuzzy c-means algorithm

A Toolbox for Easily Calibrating Omnidirectional Cameras

ABHELSINKI UNIVERSITY OF TECHNOLOGY Networking Laboratory

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

A Semi-parametric Approach for Analyzing Longitudinal Measurements with Non-ignorable Missingness Using Regression Spline

Collaboratively Regularized Nearest Points for Set Based Recognition

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1

3D Scene Reconstruction System from Multiple Synchronized Video Images

Unsupervised Learning and Clustering

A GENETIC APPROACH FOR THE AUTOMATIC ADAPTATION OF SEGMENTATION PARAMETERS

LECTURE : MANIFOLD LEARNING

Support Vector Machines

Discovering Word Senses from Text

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

On the End-to-end Call Acceptance and the Possibility of Deterministic QoS Guarantees in Ad hoc Wireless Networks

FULLY AUTOMATIC IMAGE-BASED REGISTRATION OF UNORGANIZED TLS DATA

Elsevier Editorial System(tm) for NeuroImage Manuscript Draft

Query classification using topic models and support vector machine

Scalable Parametric Runtime Monitoring

Fuzzy C-Means Initialized by Fixed Threshold Clustering for Improving Image Retrieval

Path Following Control of a Spherical Robot Rolling on an Inclined Plane

A Web Site Classification Approach Based On Its Topological Structure

CS 534: Computer Vision Model Fitting

A Weighted Method to Improve the Centroid-based Classifier

Volume 3, Issue 9, September 2013 International Journal of Advanced Research in Computer Science and Software Engineering

International Journal of Pharma and Bio Sciences HYBRID CLUSTERING ALGORITHM USING POSSIBILISTIC ROUGH C-MEANS ABSTRACT

A Novel Dynamic and Scalable Caching Algorithm of Proxy Server for Multimedia Objects

A Binarization Algorithm specialized on Document Images and Photos

A Model-Based Approach for Automated Feature Extraction in Fundus Images

Recognizing Faces. Outline

Clustering Algorithm of Similarity Segmentation based on Point Sorting

Learning Ensemble of Local PDM-based Regressions. Yen Le Computational Biomedicine Lab Advisor: Prof. Ioannis A. Kakadiaris

5 The Primal-Dual Method

Transcription:

Lnk Graph Analyss for Adult Images Classfaton Evgeny Khartonov Insttute of Physs and Tehnology, Yandex LLC 90, 6 Lev Tolstoy st., khartonov@yandex-team.ru Anton Slesarev Insttute of Physs and Tehnology, Yandex LLC 90, 6 Lev Tolstoy st., slesarev@yandex-team.ru Ilya Muhnk Rutgers Unversty, Yandex LLC 90, 6 Lev Tolstoy st., muhnklya@yahoo.om edor Romanenko Yandex LLC 90, 6 Lev Tolstoy st., fedor@yandex-team.ru Dmtry Belyaev Yandex LLC 90, 6 Lev Tolstoy st., dvbelyaev@yandex-team.ru Dmtry Kotlyarov Yandex LLC 90, 6 Lev Tolstoy st., dmko@yandex-team.ru ABSTRACT In order to protet an mage searh engne s users from undesrable results adult mages lassfer should be bult. The nformaton about lnks from webstes to mages s employed to reate suh a lassfer. These lnks are represented as a bpartte webste-mage graph. Eah vertex s equpped wth sores of adultness and deentness. The sores for mage vertexes are ntalzed wth zero, those for webste vertexes are ntalzed aordng to a text-based webste lassfer. An teratve algorthm that propagates sores wthn a webste-mage graph s desrbed. The sores obtaned are used to lassfy mages by hoosng an approprate threshold. The experments on Internet-sale data have shown that the algorthm under onsderaton nreases lassfaton reall by 7% n omparson wth a smple algorthm whh lassfes an mage as adult f t s onneted wth at least one adult ste (at the same preson level). Keywords Lnk graph analyss, adult mages lassfaton. INTRODUCTION There are two knds of approahes whh an be used to detet adult mages,.e. text-based and mage-based. The text-based approah detets adult webpages usng ther text ontent and

propagates ths nformaton to the lnked mages and pages. On the other hand, mage-based approah uses the features ontaned n the mage tself suh as fae presene, skn-olor features, onneted omponents, et. The problem of adult webpages deteton problem s a speal ase of the automat text lassfaton problem. There are many papers n ths feld whh dffer n feature extraton and seleton methods, mahne learnng algorthms used and some other detals. Sebastan made a survey [] of the man approahes to text lassfaton and ategorzaton. Text-based approahes are full of sgnfant lmtatons. Texts on many webpages do not orrespond to ther mage ontents. or example, a webmaster an mslead searh engnes tryng to show adult mages n response to popular deent queres to attrat a user s attenton. Some pages lak texts for good lassfaton. Usually a speal dtonary for text lassfyng s needed. Beause of lmtatons on sze of a tranng set and varety of words used on adult webpages t an be dffult to make suh a dtonary. Another aspet s that many words and phrases may be obsene or deent dependng on the ontext. Thus a ontext analyss s a hard task. Text-based lassfers ould be appled to bgger strutures suh as webstes. In work [] webste adultness s defned by the number of adult webpages on ths webste. The use of webstes nstead of webpages dereases lassfaton nose. The problem of adult mage lassfaton usng mage ontent s wdely nvestgated n lterature. or example, Rowley et al. [3] used mage features and proposed algorthm that reah 90% preson and 50% reall. There are papers whh use a ombnaton of text and mages features, for example [4]. urther n the paper we wll frstly ntrodue some related work and then present our method. After that we wll provde some expermental results on a real dataset. nally, we wll pont out some possble dretons of our future work.. Related Work Consderng the adult mage lassfaton problem based on an adult webstes lassfer the noton of a bpartte mage-webste graph appears. An edge n ths graph ndates whether the webste and the mages are lnked. The problem we address s how to lassfy adult mages usng the nformaton about webstes adultness. The smplest lassfer supposes an mage to be adult f t appears on at least one adult webste. The dea to analyze nternet data by exploraton event movng on a huge graph appeared about 0 years ago. Most popular algorthms of ths type are PageRank [5] and HITS [6]. Sne the paper [5] was publshed the nternet data analyss usng teratonal proess on some graph (often suh proesses desrbe Markov random walks) has beome wdespread for data lassfaton. Castllo et al. [7] used ths dea to detet spam vertes n the graph of webpages. L et al. [8] used lk graphs to mprove query ntent lassfers. A lk graph s a bpartte graph representng lk-through data. The edges theren onnet queres and URLs and are weghed by the assoated lk ounts. The authors manually labeled a small set of seed queres n a lk graph and teratvely propagated the label nformaton to the other queres untl a global equlbrum state was reahed. The authors of [9] used modfed HITS algorthm to detet pad lnks. Szummer and Craswell [0] lassfed webpages by analyzng a bpartte query-url graph. The authors labeled a small set of webpages and proposed a proedure used to lassfy all webpages usng random walks on ths graph. Deng et al. (Deng, Lyu, and Kng 009) proposed a general CO-HITS algorthm. The authors appled ther algorthm to a query suggeston problem.

In the present work we desrbe an teratve proedure of webste adultness propagaton aross a bpartte mage-webste graph. Ths proedure s used as part of the adult mage flterng system n the mage searh serve of searh engne Yandex. 3. Data The goal of the researh s to lassfy mages as adult or deent provded the text-based ste lassfaton. The algorthm uses undreted bpartte graph G = ( V; = ( S; I; whh represents the lnks between webstes and mages n the Internet. The vertexes of the partte S represent the webstes named by ther URLs. All stes URLs are trunated to ther seond doman exept for several hostngs (there are about 0 hostngs n the exepton lst). or nstane, we do not dstngush.regularhost.om,.regularhost.om and regularhost.om and treat them as a sngle vertex, but user.lvejournal.om and user.lvejournal.om are mapped to dfferent vertexes. Partte I onssts of ndexed mages grouped nto lusters by ther vsual smlarty usng Dmtry Mkhalev s algorthm, so every vertex I represents a luster of smlar mages. Clusterzaton s used n order to group reszed and ompressed opes of an mage. Suh lusterzaton redues the graph sze and dereases sparseness of the graph s ndene matrx. V = S I s a set f vertexes of G. A vertex from the frst partte s onneted wth a vertex from the seond one f there s a lnk (html tags mg or a href ) from a page on a webste to at least one mage n a orrespondng mage luster. It should be noted that no dstnton s made between ases wth many and a few lnks between a ste and an mage luster. We also suppose that all stes have been already lassfed as adult or not. Ths ntal ste lassfaton mght have errors. In ths paper we use the results of ste lassfaton produed by the text-based ste lassfer desrbed n []. 4. Data Model and Algorthm We denote the mage-ste graph as G( V; = G( S + S ; I;, where + S stands for a set of adult webstes, S stands for the rest of webstes (regarded deent). The mage set (the set of mage lusters) s referred to as I. Our goal s to lassfy mages I nto two lasses. Wth the graph G we assoate a bnary adjaeny matrx W (G). w = f there s an edge between vertexes and j and w = 0 otherwse. j Let us onsder a matrx Y (V ) whh haraterzes ntal label dstrbuton on the set of vertexes. Its sze s V wth rows enumeratng graph vertexes and olumns orrespondng to lass labels. y, y 0 f vertex s labeled as adult and y, y, f s labeled as = = = y = = 0 = deent. y 0, f vertex has no label (ths holds for all mages I ). We are lookng for a V non-negatve matrx (V ). Vetor =, ) s j ( nterpreted as vertex s lass sore, and are adultness and deentness, respetvely. (0) (0) Y s taken as ntal value of : = Y, V The problem of omputng s an optmzaton problem. or every edge (, j) n the graph we are gong to fnd a vetor par ( ; j) suh that s lose to j. The ntuton behnd these rtera s that for every par of adjaent vertexes we want ther sores of label ntensty to be as lose as possble. We hoose the followng funton as a rteron: Unfortunately, the algorthm s detals have not been publshed. The average sze of a luster s.7.

Φ =, j V W j D j D jj + µ V ( Y ) Here D denotes a dagonal matrx wth elements D = and µ s the algorthm s W j j V parameter. We adopt ths funton from paper [8] where the authors onsder the problem of label propagaton on a set of searh queres. The frst term denotes our wsh for and j to be lose. It ontans three weght multples D, D jj andw j. W j shows that we summarze dsrepany only over the graph edges. D and D jj are used to derease lass sores for vertexes wth a lot of ndent edges. The seond term s used to lmt devaton of resultant weghts from ntal weghts. Coeffent µ regularzes the bounds of ths devaton. The optmzaton an be derved analytally: = arg mn Φ = ( α)( E αa) Y, A = D α =, + µ WD E = dag(,...,), But there s a onvenent teratve proedure to fnd that explots a sparse struture of W and allows us to avod expensve alulaton of the large nverse matrx: (0) = Y, ( n+ ) ( n) = αa + ( α) Y It an be shown ([8]) that ths proedure onverges to the analytal soluton. or eah vertex V we get a vetor (, ) as a result of alulaton = arg mn Φ. In the next step we sort all the vertexes n the dereasng order of. After that we label V k top vertexes wth maxmum value of rato as + ε + ε adult. Here ε s some small onstant and k s used to adjust the tradeoff between reall and preson. 5. Experments In our experments we use the mage-ste graph and the results of automat webste lassfaton bult on 0.05.009. No addtonal nformaton s used. We remove the stes wthout mages, beause our goal s to lassfy mages. We also remove small mages. The 6 6 6 resultant graph onssts of.6 0 stes, 444 0 mage groups and 735 0 edges. The average number of mages onneted wth a ste s 460. Images wth area less than00 50. Most of them are very unnformatve: ons, avatars, buttons, et.

To the estmate qualty of the mage lassfer we use an addtonal dataset. All mages n the dataset were labeled manually as adult or not. The dataset onssts of 9475 random mages olleted from the nternet. 7 of them are adult. We have performed 4 experments dependng on 3 parameters. These parameters are: the number of teratons (n), the parameter whh regularzes dsrepany of from Y (α ), the sze of adult mages lass (k) ε s a onstant n all experments. Its value s 0.00. The followng sheme s used to hoose the parameters value. rst of all we hoose α = 0.5 from the problem s spefs (after that α was vared n 0. 0.8). We also take k = 0.04 as reasonable aordng to the problem spefs. We have performed experments wth these parameters wth n vared from up to 30 (g., g. ). After determnng the best value of n we vary α {0.,0.4,0.6,0.8} and k {0.0,0.0,0.03,0.04,0.06,0.08,0.0,0. }(3 ombnatons total) wth fxed n (g. 3). To estmate the lassfaton qualty we use two metrs,.e. preson and reall. Let us onsder a task of lassfyng nstanes S nto lasses {, }. Experts label S e S + as nstanes of lass and S e S as nstanes of lass. The lassfer splts set S nto two sets of nstanes S and S lassfed as + and -, orrespondngly. Then S preson = S reall = e S e S S e S., 6. Dsusson At frst t should be noted that the metrs onverged (g., ) after a few teratons were performed. ve teratons seem to be enough for all future experments. Moreover ths fat mples some pratal onsequenes, sne the teratons number affets omputatonal omplexty. As t an be seen from g. 3 the varaton of α wth fxed k produed a very weak effet on both reall and preson whle the varaton of parameter k allows us to hange reall and preson wthn a wde range. These experments allow us to ompare the algorthm under gure. Preson vs. Iteratons onsderaton wth the nave rule that lassfes all the mages wth at least one lnk from the adult ste as adult themselves and deent otherwse. The algorthm outperforms a smple (referred as baselne on g. 3) rule whh lassfes an mage as adult f t s onneted wth at

least one adult ste by 7% n reall wthout hanges n preson. The algorthm s reall s very mportant due to task spefs. 7. Conlusons and uture Work We have presented a new adult mage lassfaton algorthm that makes use of mage-webste lnk struture provded pror ste lassfaton results. The followng ntuton s behnd t: f a lot of suspous stes have a pture on ther pages, then the pture s a suspous one, f a ste ontans a lot of suspous mages, then t s suspous tself. gure. Reall vs. Iteratons In addton we luster mages. Ths makes the mage-ste graph less sparse, redues the vertexes number and provdes more nformaton about a sngle mage n the nternet. We beleve lusterng to be mportant n our task beause many adult stes opy adult mages from eah other. The novelty of the approah s the use of the mage-ste graph n order to solve the lassfaton problem. Our experments have shown that the system sales well and performs relably as part of a produton system. The desrbed proedure an be also appled to graph lassfaton tasks of dfferent nature. or nstane, the proedure an be used n order to detet spam stes gure 3. Preson vs. Reall usng the nternet lnk struture or to detet ommeral mages usng the webste-mage lnks. or future work our short term fous s to use textual apton-based mage lassfers. It would provde us wth labels for the mage partte of the graph that promses new resoures for the algorthm mprovement. We are gong to nvestgate how usng of ndvdual pages nstead of stes affets algorthm s performane. 8. REERENCES [] Sebastan,. 00. Mahne learnng n automated text ategorzaton. ACM Computng Surveys 34(): 47.

[] Maslov, M., Pyallng, A. and Trfonov, S. 008. Web stes automat ategorzaton. In RCDL 08: The Tenth Annversary of All-Russan Researh Conferene, 30 35. [3] Rowley, H., Jng, Y. and Baluja, S. 006. Large sale mage-based adult-ontent flterng. In VISAPP (), 90 96. INSTICC - Insttute for Systems and Tehnologes of Informaton, Control and Communaton. [4] Chen, R.-C. and Ho, C.-T. 006. A pornograph web page detetng method based on SVM model usng text and mage features. Internatonal Journal of Internet Protool Tehnology (4):64 70. [5] Page, L., Brn, S., Motwan, R. and Wnograd, T. 999. The PageRank taton rankng: Brngng order to the web. [6] Klenberg, J. M. 999. Authortatve soures n a hyperlnked envronment. Journal ACM 46(5):604 63. [7] Castllo, C., Donato, D., Gons, A., Murdok, V. and Slvestr,. 007. Know your neghbors: web spam deteton usng the web topology. In SIGIR 07: Proeedngs of the 30th annual nternatonal ACM SIGIR onferene on researh and development n nformaton retreval, 43 430. New York, NY, USA: ACM. [8] L, X., Wang, Y.-Y. and Aero, A. 008. Learnng query ntent from regularzed lk graphs. In SIGIR 08: Proeedngs of the 3st annual nternatonal ACM SIGIR onferene on Researh and development n nformaton retreval, 339 346. New York, NY, USA: ACM. [9] Nkolaev, K., Zudna, E. and Gorshkov, A. 009. Combnng anhor text ategorzaton and graph analyss for pad lnk deteton. In WWW 09: Proeedngs of the 8th nternatonal onferene on World wde web, 05 06. New York, NY, USA: ACM. [0] Szummer, M. and Craswell, N. 008. Behavoral lassfaton on the lk graph. In WWW 08: Proeedng of the 7th nternatonal onferene on World Wde Web, 4 4. New York, NY, USA: ACM. [] Deng, H., Lyu, M. R. and Kng, I. 009. A generalzed o-hts algorthm and ts applaton to bpartte graphs. In KDD 09: Proeedngs of the 5th ACM SIGKDD nternatonal onferene on Knowledge dsovery and data mnng, 39 48. New York, NY, USA: ACM.