Semantic Web Mining Diana Cerbu
Contents Semantic Web Data mining Web mining Content web mining Structure web mining Usage web mining Semantic Web Mining
Semantic web "The Semantic Web is a vision: the idea of having data on the web defined and linked in a way that it can be used by machines not just for display purposes, but for using it in various applications. [Tim Berners-Lee]
Semantic Web Layer Cake
Semantic Web Apps search engines Hakia TrueKnowledge Powerset Spock Firefox extensions Gnosis TripIt
Gnosis
Data mining Fig 3. Overview of the steps constituting the KDD process Data mining is the semi automatic extraction of patterns, changes, associations, anomalies, and other statistically significant structures from large data sets. - R. Grossman
Data mining tasks Decision Trees Naïve Bayes Neuronal Networks Association Rules Clustering
Web mining the process of discovering patterns and relations in the Web data applies data mining techniques on the web 3 areas can be distinguished: Web content mining Web structure mining Web usage mining
Why web mining? the internet has been constantly increasing in usage and popularity web pages: over 800 million (2000) html pages: ~6 TB of data every day ~1 million pages are added every month hundreds of GB worth of changes to existing pages 2006-2007: over 60 million domains have been registered(=1995-2005)
Web content mining is mostly a form of text mining (the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources) takes advantages of the semi-structured form (as opposed to databases) of html and xml pages to extract knowledge can be used to detect co-occurrences of terms in texts
Web structure mining describes the organization of the content within the website includes the organization inside a webpage, internal/ external links and the site hierarchy Google s PageRank algorithm ranks a website on the basis of how many other sites link to it used to identity information hubs used to derive models in order to predict the popularity of a website
Web usage mining describes the use of websites, reflected in a web server s access log, as well as in logs for specific application semantics created by usage identification of people with the same interests: People who liked/bought this book also looked at... online catalog: users interested in product A is also interested in product B
Usage web mining frequency of file in a web log reveals knowledge, such as: pages not of interest/ page of much interest result: reorganized site structure (not automated)
Semantic Web mining take a set of Web pages from a site and improve them for both human and machine users generate metadata that reflect a semantic model underlying the site identify patterns both in the pages text and in their usage improve information architecture and page design
Steps employ mining methods on Web resources generate mining structure employ mining methods on the resulting semantically structured Web resources generate further structure at the end, design of the Web pages themselves (visible to human users) feed back the metadata and the underlying ontology (visible to machine users)
Ontology provides the opportunity of representing arbitrary worlds includes a set of concepts, a hierarchy on them, and (n-ary) relations between concepts two types of ontologies: 1 st uses a small number of relations between concepts : e.g. Yahoo! 2 nd is rich with relations but have a rather limited description of concept, usually consisting of a short description: e.g. WordNet
Ontology learning
The ontology is filled
Knowledge base is mined
Association Rules combination of knowledge about instances like the Wellnesshotel and its Sea View golf course and knowledge derived from the Web pages texts hotels with golf courses often have five stars (Confidence, support) (89%, 0.4%)
Clustering use web document clustering techniques to improve search engine results (i.e. the search results better reflect the term/s sought) indentify a cluster of users who visit and closely examine the pages of the Wellnesshotel, the Palacehotel, and the Starhotel you might want also look at
Redesigning in order to introduce a new category golf hotels all hotels for which there is a golf course that belongs to the hotel become instances of the new category site and design page are modified by adding a new value for the search criterion hotel facilities in order to correspond to the newly added category
Benefits input: page of a site describes the Palacehotel in Zürich hotel subclass of accommodation Zürich is located in Switzerland search for accommodation in Switzerland result: Palacehotel
Q&A
Links http://www.hakia.com/ http://www.powerset.com/ http://www.trueknowledge.com/ http://www.spock.com/ http://www.tripit.com/ https://addons.mozilla.org/en- US/firefox/addon/3999 http://wordnet.princeton.edu/
Bibliography Web mining: From web to Semantic Web, Bettina Berendt, Andreas Hotho Towards Semantic Web Mining, Bettina Berendt, Andreas Hotho