BreXLiMe: A Semantically Enriched Brexit Dataset Using Cross-Lingual and Cross-Media Knowledge Extraction

Size: px

Start display at page:

Download "BreXLiMe: A Semantically Enriched Brexit Dataset Using Cross-Lingual and Cross-Media Knowledge Extraction"

Ashlynn Fleming
5 years ago
Views:

1 BreXLiMe: A Semantically Enriched Brexit Dataset Using Cross-Lingual and Cross-Media Knowledge Extraction Lei Zhang 1, Michael Färber 2, Steffen Thoma 3, Maribel Acosta 3, and Achim Rettinger 3 1 FIZ Karlsruhe Leibniz Institute for Information Infrastructure, Germany lei.zhang@fiz-karlsruhe.de 2 University of Freiburg, Germany michael.faerber@cs.uni-freiburg.de 3 Karlsruhe Institute of Technology (KIT), Germany {steffen.thoma acosta rettinger}@kit.edu Abstract. The withdrawal of the UK from the EU, known as Brexit, has already been subject to a number of sociological and economic studies using data from social media especially Twitter. However, comprehensive studies based on datasets supporting different media and languages are still missing so far. In this paper, we introduce such a dataset collected from various media sources (including online news sites, social media and live-tv) in three languages (i.e., English, German, and Spanish), which has been further semantically enriched with annotations of both entities and categories from DBpedia. This dataset is provided online in the RDF serialization formats Turtle and N-Triples and can serve as a data basis for applications and studies of various disciplines on the case of Brexit. Keywords: semantic annotation, semantic integration, semantic search, cross-lingual, cross-modality, Brexit. 1 Introduction The outcome of the Brexit referendum held on June 23, 2016 in the UK was until the last moment unclear and, hence, thrilling. In retro-perspective, besides the pure result (exiting the EU), the case of Brexit is from a sociological, political and psychological perspective particularly interesting and relevant. For instance, psychologists and sociologists are interested in how topics and opinions spread in the public discussions (visible/transmitted via public media such as news articles, social media, and TV shows) and how those topics and opinions relate to specific items such as persons in the public (David Cameron etc.) and subjects of public concern (Euroscepticism etc.). Also, with the ongoing discussion about fake media there is interest in investigating the difference of media coverage between traditional (news and TV) versus social media. Recently, many approaches have been proposed for prediction and analysis concerning Brexit. Most related work is based on data from Twitter [1 4] and

2 2 other social media platforms [5 7], where Celli [7] deals with multilingual social media data, but all other work only addresses the English language. In order to facilitate a more comprehensive investigation of Brexit, there is an impending need for datasets containing multilingual data from various multimedia sources. Also, the existing analysis of the Brexit use case ranges from using the most frequently mentioned words, topics, themes, sentiments and URLs, however, no explicit semantics in knowledge bases has been utilized for semantic processing. In this regard, we present the BreXLiMe dataset, which has been collected from the multilingual data streams regarding BreXit in different media channels and further semantically enriched by Cross-Lingual and Cross-Media knowledge extraction technologies developed in our xlime project [8]. To summarize, the main contributions of this work are threefold: 1. The BreXLiMe dataset provides a semantically enriched collection of media coverage on the Brexit referendum from different media sources (including online news sites, social media and live-tv streams) in three languages (i.e., English, German and Spanish). To the best of our knowledge, this is the first dataset regarding Brexit that contains multilingual media data from several channels. Besides the pure meta-information (such as publication date and source), the media content has been semantically enriched with DBpedia [9] entities and categories. This dataset is available in the RDF formats Turtle and N-Triples at and licensed under CC BY-NC-SA 4.0. Due to licensing issues, we cannot provide the full content of the media items, which is accessible via the provided links. 2. The BreXLiMe development architecture shows a methodology for creating such a dataset that is capable of cross-lingual and cross-media knowledge extraction from media streams (see Sec. 2) as well as a data model based on RDF that is suited to describe and integrate knowledge in different media and languages (see Sec. 3). This architecture provides a general way of semantic data processing and can be easily adapted to other use cases on request. 3. The BreXLiMe dataset has been utilized to support BreXearch, our semantic search system with a focus on the case of Brexit, which enables cross-lingual and cross-media Brexit data retrieval and analytics (see Sec. 4). Besides that, we believe that this dataset can also be beneficial to many other applications and studies. In particular, it allows researchers to investigate differences and commonalities among these media channels and languages. 2 Development Methodology The development architecture of BreXLiMe is illustrated in Fig. 1. In this section, we briefly introduce the used media sources and the main development modules including data filtering, entity linking and entity-based text categorization.

3 3 2.1 Media Sources Fig. 1. BreXLiMe development architecture. Within the context of our xlime project, we first extracted textual data from different media sources, where three main data providers (JSI Newsfeed 4, VICO 5 and Zattoo 6 ) have delivered the multimedia data as streams: JSI NewsFeed provides a real-time aggregated multilingual stream of news articles from around 75,000 news websites across the world, such as The New York Times, Bloomberg News and Spiegel Online. VICO harvests large amounts of social media data in multiple languages not only from large social networks like Twitter, Facebook, Google+, and YouTube, but also from a broad spectrum of forums, blogs and review sites. Zattoo provides live TV streams consisting of video frames and audio for around 150 multilingual channels, such as CNN International, BBC World, N24 and Tageschau24. The textual data is then extracted from these media sources accompanied by preprocessing tools for converting image and speech to text, i.e., optical character recognition (OCR) and automatic speech recognition (ASR), in the case of the video and audio streams from the TV content. 2.2 Data Filtering Based on the multilingual data streams extracted from the above media sources, the first development module of data filtering aims to collect a custom dataset of media items, which are related to the Brexit referendum held on June 23,

4 4 News Social Media TV #Items #Entities #Cate. #Items #Entities #Cate. #Items #Entities #Cate. English 166.4K 7.04M 832K 10.48M 30.4M 52.4K K 3290 German 50.2K 1.21M 251K 426.7K 1.03M 2.13M K 1120 Spanish 27.9K 856K 139.5K 1.06M 2.34M 5.3M Total 244.5K 8.25M 1.22M 11.97M 33.77M 59.85M K 4410 Table 1. Statistics about the BreXLiMe dataset. As part of the xlime project, the dataset was gathered in June 2016 by adding a set of filters on the extracted multilingual data streams. Based on the xlime use cases, we first limited the data streams of newsfeed and social media to three languages, i.e., English, German, Spanish, and selected a subset of Zattoo s available TV streams, which cover both English and German. In addition, we have implemented a subscription service that allows us to use queries to filter the full streams and the terms being filtered are related to the Brexit referendum, such as Brexit, UK and EU, which resulted in around 240 thousand new articles, 12 million microposts and 900 TV programs in a month. The statistics of the media items in the dataset is shown in Table 1. The mechanism for adding new filters to these media streams is straightforward and can be easily adapted to other use cases on request. 2.3 Cross-lingual Entity Linking As the BreXLiMe dataset serves for analyzing how topics and opinions about Brexit spread in the public discussions and how they are related to entities of public interest, the development module of cross-lingual entity linking aims to detect not only named entities (e.g., David Cameron) but also nominal entities (e.g., Prime Minister of the United Kingdom) in multilingual text extracted from the media items using DBpedia as the knowledge base. In order to match words and phrases in different languages against DBpedia entities, we have built our cross-lingual linked data lexica, called xlid-lexica, by exploiting multilingual Wikipedia to extract the cross-lingual groundings of entities. With the goal of addressing the challenges of correctness, completeness and emergence of mention detection, we employ our recent work [10] to recognize the boundaries of mentions in text that are likely to denote both named entities and nominal entities. For each detected mention, its candidate entities have been then extracted using xlid-lexica. Then, a graph-based disambiguation method has been employed to determine the final entity for each mention based on both features of mention-entity compatibility and entity-entity coherence [11]. The statistics about the detected entities in the dataset is shown in Table Entity-based Text Categorization Besides entities, it is also interesting to study how public discussions on different media channels correlate with the subjects of concern, e.g., Euroscepticism, which can be addressed by making use of the full potential of the structured knowledge base in the background. In most of the semantic knowledge bases like DBpedia, entities are organized in a category hierarchy. For example, the entity Brexit has

5 5 Fig. 2. The schema used in BreXLiMe. Apart from sioc:has creator (used for social media items) and dcterms:publisher (used for both news and social media items), the listed classes and relations are used for data from all three media sources. its parent category Category:Euroscepticism in the United Kingdom, which in turn is a subcategory of Category:Euroscepticism. By utilizing this category hierarchy, the next development module of entity-based text categorization aims to derive the categories related to media items based on their mentioned entities. For this purpose, we firstly let sociologists decide on the relevant categories in DBpedia for the Brexit use case, resulting in a set of 73 candidate categories, e.g., Category:Immigration to Europe, Category:European Union law and Category:Economy of the European Union. Given a media item, each detected entity and all its reachable candidate categories in DBpedia are then added into a directed graph, whree the scores of all categories are computed by a random walk algorithm based on both entity-category associations and category-category dependencies (see more details in [12]). Finally, the top-5 categories with the highest scores are output for each media item. The statistics about the derived categories in the dataset is shown in Table 1. 3 Data Modeling The semantic integration of cross-lingual and cross-media data streams poses a new challenge to identify a common model that suits the diversity of media data from different sources and the output of the development methodology discussed

6 6 Fig. 3. Example of an annotated social media item in RDF (Turtle fomat). in Sec. 2. To address the above challenge, we introduce a general data model in the media domain, which enables semantic integration of media data on multiple modalities, languages and sources, and thus allows for a seamless semantic access to media data streams in combination with additional background knowledge. This data model used by BreXLiMe is defined as an RDF vocabulary and tailored specifically to the different modalities: text, audio and video. It extends other vocabularies, such as the Dublin Core 7, SIOC 8 and KDO 9. Its main schema is depicted in Fig. 2. Similarly to the Web Annotation Model 10, it enables to relate text and video or audio streams to entities and categories in the knowledge base. In this work, we refrained from using the Web Annotation Model to reduce the amount of unnecessary blank nodes and thus, at query time, joins. For each entity annotation, the predicates that define the start and end positions of the entity mention are used in a flexible manner and may define character positions, in the case of text, or milliseconds/frame numbers in case of audio/video. Each category annotation captures one topic of the media content. In any case, each entity mentioned in or each topic covered by any media item should relate to a resource in the knowledge base, namely an entity or a category in DBpedia. Based on the schema shown in Fig. 2, we model the annotated media data as RDF triples, which are available online as RDF dumps for further processing and analysis. An example of annotated social media item modeled by RDF is shown in Fig. 3. A SPARQL endpoint is provided for querying the annotated media data. This enables restrictions and aggregates on multiple modalities, languages and

7 7 (a) (b) Fig. 4. Examples of SPARQL queries for (a) Brexit data retrieval and (b) analytics. media sources as well as a combination with additional background knowledge in the knowledge base, which will be discussed in Sec Applications The availability of the BreXLiMe dataset can facilitate many applications for the use case of Brexit. To demonstrate this, we present BreXearch 11, a semantic search system for cross-lingual and cross-media data retrieval and analytics for Brexit. In the following, we show two major features of BreXearch. Brexit Data Retrieval. Modern search engines are limited in their semantic processing capabilities: the retrieved Web content has to be in the same language as the search keywords and cannot be integrated across different media channels. BreXearch aims to break the barriers in between languages and modalities for a seamless semantic access to media data regarding Brexit. Through the semantic integration of BreXLiMe on multiple modalities, languages and media sources, BreXearch supports cross-lingual and cross-media Brexit data retrieval by means of both entities and categories. For example, to find the latest 100 media items in German about the subject Immigration to Europe from all three media channels, the SPARQL query in Fig. 4 (a) can be used to retrieve the results. Brexit Data Analytics. Advanced data analytics in the media domain has become a major necessity, which currently cannot be supported by modern search engines. Using the knowledge extracted by BreXLiMe from different media and languages in combination with additional background knowledge in DBepdia, BreXearch allows us to ask complex questions regarding Brexit, such as Which 11

8 8 politicians from the Conservative Party of UK were most present in social media in the last two weeks before the Brexit referendum in different languages?, which can be answered by the SPARQL query as shown in Fig. 4 (b). More importantly, based on the varieties of BreXLiMe, BreXearch provides us the ability to study differences and commonalities among these media channels and languages. 5 Conclusions In this paper, we present BreXLiMe, a semantically enriched dataset regarding Brexit supporting different media and languages. Besides the dataset itself, the BreXLiMe development methodology provides a general solution to cross-lingual and cross-media knowledge extraction from various multilingual media sources. In addition, the data model used by BreXLiMe for describing and integrating knowledge extracted from different media and languages is based on RDF and Linked Open Data standards and thus can serve as a blueprint for publishing other datasets in the media domain. Furthermore, our semantic search system BreXearch shows that BreXLiMe can serve as a data basis for applications and studies on the case of Brexit. As for the future work, we would like to provide datasets for research based on other use cases, such as the US president election in 2016, by applying the presented development methodology and data modeling. References 1. Howard, P.N., Kollanyi, B.: Bots, #strongerin, and #brexit: Computational propaganda during the UK-EU referendum. CoRR abs/ (2016) 2. Llewellyn, C., Cram, L.: Brexit? analyzing opinion on the UK-EU referendum within twitter. In: Proceedings of the Tenth International Conference on Web and Social Media, Cologne, Germany, May 17-20, (2016) Khatua, A., Khatua, A.: Leave or remain? deciphering brexit deliberations on twitter. In: IEEE International Conference on Data Mining Workshops, ICDM Workshops 2016, December 12-15, 2016, Barcelona, Spain. (2016) Hürlimann, M., Davis, B., Cortis, K., Freitas, A., Handschuh, S., Fernández, S.: A twitter sentiment gold standard for the brexit referendum. In: Proceedings of the 12th International Conference on Semantic Systems, SEMANTICS 2016, Leipzig, Germany, September 12-15, (2016) Vicario, M.D., Zollo, F., Caldarelli, G., Scala, A., Quattrociocchi, W.: The anatomy of brexit debate on facebook. CoRR abs/ (2016) 6. Lansdall-Welfare, T., Dzogang, F., Cristianini, N.: Change-point analysis of the public mood in UK twitter during the brexit referendum. In: IEEE International Conference on Data Mining Workshops, ICDM Workshops 2016, December 12-15, 2016, Barcelona, Spain. (2016) Celli, F., Stepanov, E.A., Poesio, M., Riccardi, G.: Predicting Brexit: Classifying Agreement is Better than Sentiment and Pollsters. In: The Workshop on Computational Modeling of People s Opinions, Personality, and Emotions in Social Media. (2016)

9 9 8. Zhang, L., Thalhammer, A., Rettinger, A., Farber, M., Mogadala, A., Denaux, R.: The xlime system: Cross-lingual and cross-modal semantic annotation, search and recommendation over live-tv, news and social media streams. Web Semantics: Science, Services and Agents on the World Wide Web (2017) 9. Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., Hellmann, S.: DBpedia - A crystallization point for the Web of Data. J. Web Sem. 7(3) (2009) Zhang, L., Dong, Y., Rettinger, A.: Towards entity correctness, completeness and emergence for entity recognition. In: WWW (Companion Volume). (2015) Zhang, L., Rettinger, A.: X-LiSA: Cross-lingual Semantic Annotation. PVLDB 7(13) (2014) Zhang, L., Xu, Y., Rettinger, A.: A joint method for entity linking and text categorization by exploiting knowledge bases. Technical report, Institut AIFB, KIT, el tc.pdf (2017)

DBpedia-An Advancement Towards Content Extraction From Wikipedia

DBpedia-An Advancement Towards Content Extraction From Wikipedia Neha Jain Government Degree College R.S Pura, Jammu, J&K Abstract: DBpedia is the research product of the efforts made towards extracting