Machine Learning for Automatic Classification of Web Service Interface Descriptions

Similar documents
Controlled Information Maximization for SOM Knowledge Induced Learning

Journal of World s Electrical Engineering and Technology J. World. Elect. Eng. Tech. 1(1): 12-16, 2012

An Unsupervised Segmentation Framework For Texture Image Queries

Information Retrieval. CS630 Representing and Accessing Digital Information. IR Basics. User Task. Basic IR Processes

Optical Flow for Large Motion Using Gradient Technique

Point-Biserial Correlation Analysis of Fuzzy Attributes

RANDOM IRREGULAR BLOCK-HIERARCHICAL NETWORKS: ALGORITHMS FOR COMPUTATION OF MAIN PROPERTIES

Detection and Recognition of Alert Traffic Signs

IP Network Design by Modified Branch Exchange Method

Segmentation of Casting Defects in X-Ray Images Based on Fractal Dimension

a Not yet implemented in current version SPARK: Research Kit Pointer Analysis Parameters Soot Pointer analysis. Objectives

Frequency Domain Approach for Face Recognition Using Optical Vanderlugt Filters

A Two-stage and Parameter-free Binarization Method for Degraded Document Images

SYSTEM LEVEL REUSE METRICS FOR OBJECT ORIENTED SOFTWARE : AN ALTERNATIVE APPROACH

AUTOMATED LOCATION OF ICE REGIONS IN RADARSAT SAR IMAGERY

A VECTOR PERTURBATION APPROACH TO THE GENERALIZED AIRCRAFT SPARE PARTS GROUPING PROBLEM

An Extension to the Local Binary Patterns for Image Retrieval

Effects of Model Complexity on Generalization Performance of Convolutional Neural Networks

XFVHDL: A Tool for the Synthesis of Fuzzy Logic Controllers

Spiral Recognition Methodology and Its Application for Recognition of Chinese Bank Checks

Illumination methods for optical wear detection

Positioning of a robot based on binocular vision for hand / foot fusion Long Han

ANALYTIC PERFORMANCE MODELS FOR SINGLE CLASS AND MULTIPLE CLASS MULTITHREADED SOFTWARE SERVERS

HISTOGRAMS are an important statistic reflecting the

Haptic Glove. Chan-Su Lee. Abstract. This is a final report for the DIMACS grant of student-initiated project. I implemented Boundary

A Recommender System for Online Personalization in the WUM Applications

Comparisons of Transient Analytical Methods for Determining Hydraulic Conductivity Using Disc Permeameters

Communication vs Distributed Computation: an alternative trade-off curve

A ROI Focusing Mechanism for Digital Cameras

Shortest Paths for a Two-Robot Rendez-Vous

Experimental and numerical simulation of the flow over a spillway

Modelling, simulation, and performance analysis of a CAN FD system with SAE benchmark based message set

A Novel Automatic White Balance Method For Digital Still Cameras

Multi-azimuth Prestack Time Migration for General Anisotropic, Weakly Heterogeneous Media - Field Data Examples

INDEXATION OF WEB PAGES BASED ON THEIR VISUAL RENDERING

Lecture 27: Voronoi Diagrams

A New and Efficient 2D Collision Detection Method Based on Contact Theory Xiaolong CHENG, Jun XIAO a, Ying WANG, Qinghai MIAO, Jian XUE

And Ph.D. Candidate of Computer Science, University of Putra Malaysia 2 Faculty of Computer Science and Information Technology,

A Shape-preserving Affine Takagi-Sugeno Model Based on a Piecewise Constant Nonuniform Fuzzification Transform

Optimal Adaptive Learning for Image Retrieval

ANN Models for Coplanar Strip Line Analysis and Synthesis

IP Multicast Simulation in OPNET

Cardiac C-Arm CT. SNR Enhancement by Combining Multiple Retrospectively Motion Corrected FDK-Like Reconstructions

A Comparative Impact Study of Attribute Selection Techniques on Naïve Bayes Spam Filters

A modal estimation based multitype sensor placement method

Improvement of First-order Takagi-Sugeno Models Using Local Uniform B-splines 1

Clustering Interval-valued Data Using an Overlapped Interval Divergence

Fifth Wheel Modelling and Testing

Color Correction Using 3D Multiview Geometry

Prof. Feng Liu. Fall /17/2016

Transmission Lines Modeling Based on Vector Fitting Algorithm and RLC Active/Passive Filter Design

Parallel processing model for XML parsing

Ranking Visualizations of Correlation Using Weber s Law

FINITE ELEMENT MODEL UPDATING OF AN EXPERIMENTAL VEHICLE MODEL USING MEASURED MODAL CHARACTERISTICS

The EigenRumor Algorithm for Ranking Blogs

BUPT at TREC 2006: Spam Track

The International Conference in Knowledge Management (CIKM'94), Gaithersburg, MD, November 1994.

Input Layer f = 2 f = 0 f = f = 3 1,16 1,1 1,2 1,3 2, ,2 3,3 3,16. f = 1. f = Output Layer

Annales UMCS Informatica AI 2 (2004) UMCS

Slotted Random Access Protocol with Dynamic Transmission Probability Control in CDMA System

View Synthesis using Depth Map for 3D Video

ADDING REALISM TO SOURCE CHARACTERIZATION USING A GENETIC ALGORITHM

POMDP: Introduction to Partially Observable Markov Decision Processes Hossein Kamalzadeh, Michael Hahsler

A Mathematical Implementation of a Global Human Walking Model with Real-Time Kinematic Personification by Boulic, Thalmann and Thalmann.

A Memory Efficient Array Architecture for Real-Time Motion Estimation

Extract Object Boundaries in Noisy Images using Level Set. Final Report

Any modern computer system will incorporate (at least) two levels of storage:

MULTI-TEMPORAL AND MULTI-SENSOR IMAGE MATCHING BASED ON LOCAL FREQUENCY INFORMATION

Module 6 STILL IMAGE COMPRESSION STANDARDS

Desired Attitude Angles Design Based on Optimization for Side Window Detection of Kinetic Interceptor *

MapReduce Optimizations and Algorithms 2015 Professor Sasu Tarkoma

A Generalized Profile Function Model Based on Artificial Intelligence

Effective Missing Data Prediction for Collaborative Filtering

FACE VECTORS OF FLAG COMPLEXES

Linear Ensembles of Word Embedding Models

Towards Adaptive Information Merging Using Selected XML Fragments

Lecture # 04. Image Enhancement in Spatial Domain

Generalized Grey Target Decision Method Based on Decision Makers Indifference Attribute Value Preferences

Adaptation of Motion Capture Data of Human Arms to a Humanoid Robot Using Optimization

On Error Estimation in Runge-Kutta Methods

Methods for history matching under geological constraints Jef Caers Stanford University, Petroleum Engineering, Stanford CA , USA

Simulation and Performance Evaluation of Network on Chip Architectures and Algorithms using CINSIM

Modeling spatially-correlated data of sensor networks with irregular topologies

ART GALLERIES WITH INTERIOR WALLS. March 1998

arxiv: v2 [physics.soc-ph] 30 Nov 2016

A Minutiae-based Fingerprint Matching Algorithm Using Phase Correlation

Directional Stiffness of Electronic Component Lead

Evaluation of Second-order Visual Features for Land-Use Classification

A New Finite Word-length Optimization Method Design for LDPC Decoder

3D Hand Trajectory Segmentation by Curvatures and Hand Orientation for Classification through a Probabilistic Approach

REAL-TIME VISUALIZATION OF WOVEN TEXTILES

Scaling Location-based Services with Dynamically Composed Location Index

5 4 THE BERNOULLI EQUATION

A Study of Semi-Discrete Matrix Decomposition for LSI in Automated Text Categorization

A Consistent, User Friendly Interface for Running a Variety of Underwater Acoustic Propagation Codes

Hierarchically Clustered P2P Streaming System

Conversion Functions for Symmetric Key Ciphers

Assessment of query reweighing, by rocchio method in farsi information retrieval

Topological Characteristic of Wireless Network

New Algorithms for Daylight Harvesting in a Private Office

Transcription:

Machine Leaning fo Automatic Classification of Web Sevice Inteface Desciptions Amel Bennaceu 1, Valéie Issany 1, Richad Johansson 4, Alessando Moschitti 3, Romina Spalazzese 2, and Daniel Sykes 1 1 Inia, Pais-Rocquencout, Fance fistname.lastname@inia.f 2 Univesità degli Studi dell Aquila, Italy omina.spalazzese@di.univaq.it 3 DISI, Univesity of Tento, Italy moschitti@disi.unitn.it 4 Cente fo Language Technology, Univesity of Gothenbug, Sweden ichad.johansson@gu.se Abstact. We ague that automatic classification of web sevice inteface desciption into a pedefined set of categoies can consideably speed up the task of finding compatible web sevices. This can geatly impove to establish automatic connection between netwoked systems. To cay out the classification, we show that techniques deived fom automatic document classification can be successfully applied. We descibe how the standad document classification techniques need to be adapted fo ou task, and finally pesent the esults of a numbe of expeiments in classification of web sevice inteface desciptions. 1 Intoduction Systems inteopeability is becoming moe and moe impotant given the lage incease of both types and use of connecting devices, e.g., smat phones, laptop, tablets, and so on. This fact, along with the lage vaiety of sevices that can be offeed to the use, highly inceases the communication complexity. In this pespective, automated solutions fo establishing inteopeability between the netwoked systems appea to be the only viable appoach to achieve the equied level of flexibility and scalability. Unfotunately, taditional solutions to detemine compatibility between systems ae athe expensive in tems of computational cost, especially when, these ae applied to systems in unelated domains. Indeed, a compatibility assessment equies in-depth analyses consideing the inteface and convesational potocol of the two taget connecting systems. One way to speed up the assessment above is to apply machine leaning methods to automatically classify the high-level functionality of a system based on its inteface desciption. This allows fo esticting the scope of compatibility checks and consequently poviding an oveall pefomance gain when conducting matchmaking between systems. In this pape, we descibe how the inteface desciption classifies ae implemented by applying machine leaning: inducing the classification function fom a set of examples. We descibe how the standad document classification techniques need to be changed in ode to be adapted to wok with inteface desciptions: most impotantly, the featue extaction function needs to pocess the semi-stuctued data that is available in files witten in the Web Sevices Desciption Language (WSDL) 5. 5 http://www.w3.og/tr/wsdl

We caied out a numbe of expeiments that evaluate the effect of seveal design paametes in the implementation of the categoization system, such as the design of the featue extaction function and the choice of machine leaning method. In the eminde of this pape, Section 2 descibes the taget of the automated classifies, i.e., the web sevices desciption files, Section 3 descibes the machine leaning methods we apply, i.e., automated text categoization, specialized fo the taget task, Section 4 pesents ou expeiments on the classification of desciption files, and, finally, Section 5 deives the conclusions. 2 Web Sevice Inteface Desciptions and the WSDL Fomat Web sevices nomally expose a desciption of thei pogammatic inteface (API) using the standad WSDL (Web Sevice Desciption Language 6 ) fomat. This desciption details all the opeations which can be pefomed by the web sevice, and data types (in XML Schema 7 ) that ae inputs o outputs of the opeations. WSDL also pemits the inclusion of human-eadable documentation and has some suppot fo specifying the semantics of opeations (and data) though the ontological annotations suppoted in SA-WSDL 8. Thee is howeve no stuctued suppot fo specifying what the sevice does at a high level, i.e. the abstact categoy to which the sevice belongs. In the Connect poject we define a language fo specifying these categoies by efeence to ontology concepts [2] in ode to facilitate the identification of sevices with simila, compatible functionality. Sevices in the same categoy ae expected to have simila functionality, which may be povided to the envionment, o equied fom it, and which allows the sevices to be composed and inteact. Howeve, legacy sevices do not make use of the explicit Connect desciption language. Without infomation about the high-level functionality of sevices, it is necessay to employ time-consuming syntactic and behavioual analyses that detemine compatibility as a vey fine-gained level of detail. Since Connect aims to ovecome inteopeability issues at untime, computationally expensive pocedues must be avoided. Sevice categoies povide a cheap means to detemine appoximate compatibility, befoe applying moe detailed checks whee necessay. Consequently we need a means to detemine the high-level functionality of a sevice given only the WSDL desciption. Figue 2 shows a patial WSDL desciption fo a weathe sevice, taken fom the web. It lists a numbe of opeations with names such as GetWeatheByZipCode and GetWeatheByPlaceName. Each opeation efes to messages which in tun detemine the input and output data of the opeation (defined elsewhee in the file). Each opeation also includes a shot piece of documentation, although this latte pat is not always pesent. It is howeve obvious that this sevice has functionality elated to the weathe, and when pesented with a taxonomy of categoies a human would likely be able to assign this sevice to one of them: this is the pocess we seek to automate. 3 Applying Document Classification Techniques fo Classification of Web Inteface Desciptions In ode to build automatic classifies of web sevice intefaces, we will build on the consideable amount of eseach that has been caied out on the topic of automat- 6 http://www.w3.og/tr/wsdl 7 http://www.w3.og/xml/schema 8 http://www.w3.og/2002/ws/sawsdl/

<wsdl:pottype name="weathefoecastsoap"> <wsdl:opeation name="getweathebyzipcode"> <wsdl:documentation xmlns:wsdl="http://schemas.xmlsoap.og/wsdl/"> Get one week weathe foecast fo a valid Zip Code (USA) <wsdl:input message="tns:getweathebyzipcodesoapin" /> <wsdl:output message="tns:getweathebyzipcodesoapout" /> </wsdl:opeation> <wsdl:opeation name="getweathebyplacename"> <wsdl:documentation xmlns:wsdl="http://schemas.xmlsoap.og/wsdl/"> Get one week weathe foecast fo a place name (USA) <wsdl:input message="tns:getweathebyplacenamesoapin" /> <wsdl:output message="tns:getweathebyplacenamesoapout" /> </wsdl:opeation> </wsdl:pottype> <wsdl:pottype name="weathefoecasthttpget"> <wsdl:opeation name="getweathebyzipcode"> <wsdl:documentation xmlns:wsdl="http://schemas.xmlsoap.og/wsdl/"> Get one week weathe foecast fo a valid Zip Code (USA) <wsdl:input message="tns:getweathebyzipcodehttpgetin" /> <wsdl:output message="tns:getweathebyzipcodehttpgetout" /> </wsdl:opeation> <wsdl:opeation name="getweathebyplacename"> <wsdl:documentation xmlns:wsdl="http://schemas.xmlsoap.og/wsdl/"> Get one week weathe foecast fo a place name (USA) <wsdl:input message="tns:getweathebyplacenamehttpgetin" /> <wsdl:output message="tns:getweathebyplacenamehttpgetout" /> </wsdl:opeation> </wsdl:pottype> <wsdl:pottype name="weathefoecasthttppost"> <wsdl:opeation name="getweathebyzipcode"> <wsdl:documentation xmlns:wsdl="http://schemas.xmlsoap.og/wsdl/"> Get one week weathe foecast fo a valid Zip Code (USA) <wsdl:input message="tns:getweathebyzipcodehttppostin" /> <wsdl:output message="tns:getweathebyzipcodehttppostout" /> </wsdl:opeation> <wsdl:opeation name="getweathebyplacename"> <wsdl:documentation xmlns:wsdl="http://schemas.xmlsoap.og/wsdl/"> Get one week weathe foecast fo a place name (USA) <wsdl:input message="tns:getweathebyplacenamehttppostin" /> <wsdl:output message="tns:getweathebyplacenamehttppostout" /> </wsdl:opeation> </wsdl:pottype> Fig. 1. WSDL fagment fo a weathe sevice.

ically assigning a categoy tag to a given text document. This is a task with many pactical applications in the eal wold [1]. We give an intoduction to the topic of automatic classification of text documents, and descibe how machine leaning techniques ae typically applied to solve this task. We then show how these techniques can be adapted at vey little effot fo the task of inteface desciption classification. The most significant change fom standad document classification methods is that the featue extacto the function that detemines the attibutes used to distinguish the categoies needs to be tailoed fo the specifics of web sevice inteface desciptions. 3.1 An Intoduction to Automatic Classification of Documents The complexity of the categoy systems may vay depending on the application. The simplest would be a binay classification such as spam filteing. Slightly moe complex categoy system can be seen in tasks such as sentiment classification of eviews [13], whee the task of the classifie would be to pedict the numbe of stas assigned by the eviewe. The lagest categoy systems ae typically hieachically oganized, such as the categoies used in well-known Reutes dataset [9], and we may well imagine even moe advanced categoies such as the stuctued classifications used in libay science [14]. 3.2 Machine Leaning fo Automatic Document Classification The task of categoizing documents is usually tackled by applying classifies that have been automatically induced by estimation on a collection of document. We efe to the pocess of automatically inducing a classification function fom data as machine leaning, and the collection of documents on which the estimation is caied out is called the taining set [10]. We may divide the set of machine leaning methods into two boad categoies: supevised leaning, whee each document in the taining set is associated with a document categoy assigned by a human and the task of the machine leane is to induce a function that poduces simila labelings, and unsupevised leaning, whee the documents ae not labeled a pioi and the machine leaning method must take esponsibility fo finding a meaningful division into categoies. This pape focuses on the fome method, which has geneally been much moe successful in most studies. Thee ae a vey lage vaiety of methods to cay out supevised machine leaning of classifies. Fo classification of documents the most popula leaning methods ae based on the idea of associating classes of documents with egions in a vecto space. Taining a classifie becomes equivalent to descibing the decision suface, the bounday between the egions in the space. In most cases this is descibed using a linea function, so the decision suface becomes a hypeplane. Methods fo inducing the linea sepaato include the well-known pecepton algoithm [15]. The most notable ecent advance in machine leaning by inducing linea sepaato is the suppot vecto machine (SVM) [3], which has been vey successfully applied to the task of document classification [8]. The suppot vecto classification appoach is based on finding the maximal sepaation between the classes the maximal magin. In case the classes ae not fully sepaable, soft magins ae intoduced, which pemit a small numbe of violations of the sepaation constaint. Figue 2 shows an illustation of a soft-magin suppot vecto machine. Note that this is much simple than in ealistic document classification, whee the documents may be epesented as points in a vecto space of millions of dimensions athe than just two. In addition to the simple task of assigning a categoy label to a document, simila machine leaning techniques can also be applied in vey complex categoization

ξi xi w x + b = 1 w w x + b = 1 1 1 w x + b = 0 Fig. 2. Example of the decision bounday (dashed line) and the magins in a soft-magin suppot vecto machine. Thee ae two violations of the magin constaints. tasks. An impotant type of such categoization tasks ae those equiing a classification of a pai of documents, fo instance detemining whethe a text is a good answe to a given question [11]. 3.3 Extaction of Featues In ode to be able to assign objects into categoies, an automatic classifie needs to detemine the salient popeties of those objects. This pocess is called featue extaction. A good featue extacto should extact exactly those featues that make it easy to distinguish the categoies. As it has been noted epeatedly, designing featue extactos is an at moe than a science, and equies a good undestanding of the classification task. Fo the task of document classification, the most common featue extaction method is called the bag-of-wods epesentation [16]. In this method, a document is conveted into a point in a vecto space; evey wod in the vocabulay is associated with a dimension of the vecto space, allowing the document to be mapped into the vecto space simply by computing the occuence fequencies of each wod. The bag-of-wods epesentation is consideed the standad epesentation undelying most document classification appoaches, and attempts to incopoate moe complex stuctual infomation have mostly been unsuccessful fo the task of categoization of single documents [12]. 3.4 Featue Extaction in Web Sevice Inteface Desciptions WSDL documents geneally contain a significant amount of text, and this text can be diectly pocessed using standad a bag-of-wods epesentation method. Howeve, the most infomative pat of the WSDL inteface desciptions nomally

consists of stuctued data: names of methods, objects, paametes. These elements ae epesented as a tee-stuctued XML stuctue. It should be noted that vey little textual documentation may be available if the identifies ae infomatively named and thei pupose obvious. This means that the vaious semi-stuctued identifies that ae pat of the WSDL inteface desciption XML documents should be added to the bag-of-wods featue epesentation. Most impotantly, the epesentation should include the names of the method and input paametes defined by the inteface. The inclusion of identifies will be impotant since: (1) the textual content of the identifies is often highly infomative of the functionality povided by the espective methods; and (2) the fee text documentation is not mandatoy and may not always be pesent. To extact useful wod tokens fom the identifies, we split them into pieces based on the pesence of undescoes o CamelCase. All tokens wee then nomalized to lowecase. Fo instance, conside the following piece of WSDL code. <wsdl:message name="getweathebyzipcodesoapin"> <wsdl:pat name="paametes" element="tns:getweathebyzipcode" /> </wsdl:message> <wsdl:message name="getweathebyzipcodesoapout"> <wsdl:pat name="paametes" element="tns:getweathebyzipcoderesponse" /> </wsdl:message> In this example, we split the CamelCased identifie GetWeatheByZipCode into the tokens get, weathe, by, zip, and code, so the complete bag-of-wods vecto fo the example will be [get:4, weathe:4, by:4, zip:4, code:4, soap:2, in:1, out:1, esponse:1]. 4 Expeiments We caied out a lage numbe of expeiments to find the best way to implement the categoize of web sevice inteface desciptions. As descibed in Sec. 3.2, we implemented the classifies by automatically inducing them fom labeled data. Fo this pupose, we used a collection of WSDL documents 9 [7]. We selected the 10 most fequent categoies, in total 397 documents. To tain the classifies, we used suppot vecto machines that we tained using the LibLinea machine leaning softwae [6]. The following subsections descibe the expeiments. All esults have been obtained using a 10-fold coss-validation pocedue: split the data into 10 pieces; fom 10 diffeent taining sets by excluding each piece; tain 10 classifies; evaluate on each piece and combine the esult. 4.1 Classification Results To evaluate the pefomance of the classifies, we computed a numbe of diffeent evaluation meaues. The fist one is the oveall classification accuacy, which is 9 http://www.andeas-hess.info/pojects/annotato/ws2003.html

Categoy Numbe of instances CountyInfo 64 Money 54 Convete 49 Finde 46 Communication 45 Web 39 Developes 37 News 30 Business 23 Mathematics 10 Total 397 Table 1. Statistics fo the data collection. defined as the popotion of coectly classified inteface desciptions. Ou best implementation coectly classified 236 out of 397 desciptions, giving us an accuacy of 59.4%. In addition to the oveall accuacy, we evaluated the classification pefomance on individual categoies. Hee, we used the pecision and ecall measues. Fo a categoy C, we define the pecision is defined as the numbe the classifie coectly pedicted C divided by the numbe of times it pedicted C at all. Convesely, we define the ecall as the numbe of coectly pedicted C divided by the numbe of C in the dataset. Finally, it is vey common to pesent the hamonic mean of pecision and ecall, which is efeed to as the F -measue. In most situations, thee is a tadeoff elationship between the pecision and ecall measues: if we often pedict C, then we would also find many C (highe ecall) but also ovegeneate (lowe pecision). By vaying a class sensitivity paamete when taining, we can tune the pecision/ecall tadeoff, and plot the elationship in a gaph. Figue 3 shows such plots fo the fou lagest classes of intefaces: CountyInfo, Money, Convete and Finde. In this kind of plot, oveall pediction quality fo a class is detemined by how close the plot is to the uppe ight cone. In ou case, we see that the oveall pediction quality fo the fou classes tends to be coelated with the size of the class: CountyInfo class is the class fo which the plot is closest to the uppe ight cone. In addition to the classwise pecision and ecall evaluations, we computed the maco pecision and ecall, which ae pecision and ecall values aveaged ove all classes. In this maco evaluation, ou classifie achieved a pecision of 58.0, a ecall of 52.8, and an F -measue of 55.3. 4.2 The Effect of the Design of the Featue Extacto The challenge in inteface desciption classification compaed to taditional document classification is the featue design poblem, and in paticula the poblem of making use of the WSDL stuctue. As a baseline, we used taditional bag-of-wod featue extactos that ae nomally used in document classification. We applied the baseline featue extactos to the text available in the documents: the code documentation and the comments. As an altenative to the aw text, we extacted featues fom the stuctued text, i.e. the identifies used in the WSDL code. As discussed in Sec. 3.4, we used an identifie splitting heuistic based on the pesence of CamelCase. Howeve, we also tied out an identifie-based featue epesentation that did not split the identifies. Finally, we used a featue epesentation that combined the BOW and WSDL identifie featues. Hee, we evaluated two diffeent epesentations. In the fist

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.0 CountyInfo 1.0 Money 0.8 0.8 0.6 0.6 Recall Recall 0.4 0.4 0.2 0.2 0.0 Pecision 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Pecision 1.0 Convete 1.0 Finde 0.8 0.8 0.6 0.6 Recall Recall 0.4 0.4 0.2 0.2 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Pecision 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Pecision Fig. 3. Pecision / ecall plots fo the fou most fequent classes: CountyInfo, Money, Convete and Finde. Repesentation Accuacy Maco pecision Maco ecall Maco F-measue BOW (doc. only) 24.9 39.5 28.4 33.0 BOW (doc. and comments) 25.2 36.1 27.8 31.5 WSDL Identifies 41.6 33.7 32.8 33.2 WSDL Identifies, CC splitting 58.2 54.8 51.6 53.2 BOW + Indentifies (sepaate) 57.9 55.1 51.1 53.0 BOW + Indentifies (mixed) 59.4 58.0 52.8 55.3 Table 2. Evaluation of featue extaction methods. appoach we used two sepaate vecto spaces fo the two types of featues. In the second one, thee was only one vecto space, so fo instance if the wod withdaw would appea in the documentation o in an identifie, this would esult in the same featue being enabled. To see the effect of the featue extaction design choices, we caied out an evaluation of seveal diffeent featue extactos. The esult of this expeiment is shown in Table 2. We show the oveall classification pefomance using accuacy and maco pecision/ecall/f -measue. The expeiment shows vey clealy that puely text-based featue epesentations ae not sufficient fo achieving a good classification pefomance. It is cucial to extact featues fom the WSDL identifies, and they need to be split in ode to be useful. The best epesentation was the mixed combination of BOW and identifie featues; the combination by sepaate vecto spaces seems to lead to featue spasity, which is vey poblematic when using small taining sets.

Leaning method Accuacy Maco pecision Maco ecall Maco F-measue Binaized L2-SVM 59.4 58.0 52.8 55.3 Binaized L1-SVM 58.9 55.3 49.5 52.3 Multiclass SVM 59.4 55.0 52.9 53.9 Logistic egession 49.9 47.1 39.3 42.9 Pecepton 51.1 51.0 45.6 48.1 Passive aggessive 58.4 53.3 51.4 52.4 Table 3. Evaluation of machine leaning methods. 4.3 The Choice of Machine Leaning Algoithm In addition to the featue extaction mechanism, anothe cucial paamete when building a classifie is the selection of a machine leaning algoithm fo taining. We evaluated a wide ange of leaning algoithms. Ou pimay appoach was suppot vecto machines (SVMs) since this algoithm has been vey successful in text categoization [8]. We used seveal vaiations of the SVM leaning algoithm. The oiginal SVM fomulation [3] was esticted to two-class classification. This means that if we want to apply SVMs to classification poblems with moe than two classes, we must apply a tanfomation of the poblem a binaization. The most common binaization method is called one-vesus-all: ceate one SVM fo each class, and select the highest-scoing class at test time. We tied two binaized SVM vaiants: L 2 -loss and L 1 -loss SVMs; L 2 -loss is the standad fomulation and L 1 -loss is a newe vaiant that typically uses much smalle featue sets [17]. In addition to the binaized SVM vaiants, we used a ecent SVM fomulation that can solve multiclass poblems diectly [5]. Apat fom the SVMs, we evaluated classifies tained using the logistic egession, pecepton [15], and passive aggessive [4] leaning methods. All SVMs, as well as the logistic egession classifies, wee implemented using LibLinea. The pecepton and passive aggessive classifies wee based on ou own implementations. Table 3 shows the esult of the machine leaning method compaison. We notice that all SVM vaiants outpefom the othe leaning algoithms. The bestpefoming SVM is the most commonly used vaiant: the L 2 -loss SVM with standad binaization. 4.4 Geneating Multiple Hypotheses The pupose of a categoize of web sevices is to educe the time it takes to decide whethe we can automatically connect two diffeent web sevices. While we would cetainly like to have a classifie that pefecly pedicts the coect class, this is of couse not ealistic; howeve, if a classifie that is allowed to make multiple pedictions, the pobability of the coect class being found is much highe. Such a classifie would also save consideable connection time. We measued the pefomance of a classifie that is allowed to output k diffeent categoy labels. In this evaluation, a pediction is counted as coect if the set of guesses contains the coect answe. The elationship between the k value and the pefomance is shown in Table 4. We see that even with a small k, we can get vey high classification pefomance.

k Accuacy Maco pecision Maco ecall Maco F-measue 1 59.4 58.0 52.8 55.3 2 71.5 71.0 65.6 68.2 3 79.3 79.1 75.0 77.0 4 85.1 85.1 81.3 83.2 5 87.7 88.0 83.9 85.9 Table 4. Classification pefomance when pedicting k possible hypotheses. 5 Conclusion Thee is a consideable need fo automatic composition of web sevices. The Connect poject addesses, in paticula, inteopeability issues aising fom the need to compose heteogeneous systems at untime. The fist step in composing such systems is detemining whethe, and to what degee, the systems ae compatible. At the highest level of abstaction, systems in the same domain categoy, e.g. weathe, have the potential to inteact. Automatic classification of web sevice inteface desciptions is a technique that can speed up the sevice matching pocedue consideably by allowing us to avoid expensive behavioal analyses that encumbe the untime composition of sevices. We descibed how methods deived fom automatic document classification based on machine leaning can be used to build categoizes of web sevice inteface desciptions. We caied out a numbe of expeiments in automatic categoization of inteface desciptions, to detemine the best way to implement an automatic system to cay out this kind of categoization. We evaluated the effect of the choice of machine leaning method, and we saw that suppot vecto machines gave the best pefomance. Most impotantly, we evaluated how the pefomance is influenced by the design of the featue extaction component of the classifie. We saw vey clealy that standad document classification methods ae not diectly applicable: such an appoach leads to vey low pefomance. Instead, we need to use a featue epesentation that is tailoed to the task of inteface desciption classification by using the specific stuctue of the WSDL code, in paticula its identifies. We saw that a classifie that pedicts a numbe of possible altenatives (not just one) achieves vey high pefomance levels. We believe that an appoach using multiple hypothesis can also be useful fo sevice matching. Acknowledgements This eseach has been suppoted by the EU FP7 pojects: Connect Emegent Connectos fo Etenal Softwae Intensive Netwoking Systems (poject numbe FP7 231167), EtenalS Tustwothy Etenal Systems via Evolving Softwae, Data and Knowledge (poject numbe FP7 247758) and by the EC Poject, LiMo- SINe Linguistically Motivated Semantic aggegation engines (poject numbe FP7 288024). Refeences 1. R. Basili and A. Moschitti. Automatic Text Categoization: fom Infomation Retieval to Suppot Vecto Leaning. Aacne editice, Rome, Italy, 2005.

2. A. Bennaceu, G. S. Blai, F. Chauvel, N. Geogantas, P. Gace, V. Nundloll, M. Paolucci, R. Saadi, and D. Sykes. Intemediate connect achitectue. Technical Repot D1.2, Connect ICT FET IP Poject, Febuay 2011. 3. B. E. Bose, I. Guyon, and V. Vapnik. A taining algoithm fo optimal magin classifies. In Computational Leaning Theoy, pages 144 152, Pittsbugh, United States, 1992. 4. K. Camme, O. Dekel, J. Keshet, S. Shalev-Schwatz, and Y. Singe. Online passiveaggessive algoithms. Jounal of Machine Leaning Reseach, 2006(7):551 585, 2006. 5. K. Camme and Y. Singe. On the algoithmic implementation of multiclass kenelbased vecto machines. Jounal of Machine Leaning Reseach, 2001(2):265 585, 2001. 6. R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. Liblinea: A libay fo lage linea classification. Jounal of Machine Leaning Reseach, 9:1871 1874, 2008. 7. A. Heß and N. Kushmeick. Leaning to attach semantic metadata to web sevices. In Intenational Semantic Web Confeence, pages 258 273, 2003. 8. T. Joachims. Leaning to Classify Text using Suppot Vecto Machines. Kluwe/Spinge, Boston, 2002. 9. D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. RCV1: A new benchmak collection fo text categoization eseach. Jounal of Machine Leaning Reseach, 5:361 397, 2004. 10. T. M. Mitchell. Machine Leaning. McGaw-Hill, 1997. 11. A. Moschitti. Kenel methods, syntax and semantics fo elational text categoization. In Poceedings of ACM 17th Confeence on Infomation and Knowledge Management (CIKM), Napa Valley, United States, 2008. 12. A. Moschitti and R. Basili. Complex linguistic featues fo text classification: A compehensive study. In Poceedings of the 26th Euopean Confeence on Infomation Retieval Reseach (ECIR 2004), pages 181 196, Sundeland, United Kingdom, 2004. 13. B. Pang, L. Lee, and S. Vaithyanathan. Thumbs up? Sentiment classification using machine leaning techniques. In Poceedings of EMNLP, 2002. 14. S. R. Ranganathan. Colon Classification. Ess Ess Publications, Delhi, 2006. 15. F. Rosenblatt. The Pecepton: A pobabilistic model fo infomation stoage and oganization in the bain. Psychological Review, 65(5):386 408, 1958. 16. G. Salton, A. Wong, and C. S. Yang. A vecto space model fo automatic indexing. Technical Repot TR74-218, Depatment of Compute Science, Conell Univesity, Ithaca, New Yok, 1974. 17. B. Schölkopf and A. J. Smola. Leaning with Kenels: Suppot Vecto Machines, Regulaization, Optimization, and Beyond. MIT Pess, 2002.