Converting and Representing Social Media Corpora into TEI: Schema and Best Practices from CLARIN-D

Similar documents
Integrating corpora of computer-mediated communication in CLARIN-D: Results from the curation project ChatCorpus2CLARIN

A Discourse-structured Blog Corpus for German: Challenges of Compilation and Annotation

<thread> <posting>... </posting> </thread> A TEI Schema for the Annotation of CMC Genres

WebAnno: a flexible, web-based annotation tool for CLARIN

Figure 0.1: RapidMiner user interface: import/export operators: reading and writing of data

Néonaute: mining web archives for linguistic analysis

How can CLARIN archive and curate my resources?

Sustainability of Text-Technological Resources

(Some) Standards in the Humanities. Sebastian Drude CLARIN ERIC RDA 4 th Plenary, Amsterdam September 2014

Erhard Hinrichs, Thomas Zastrow University of Tübingen

CLARIN for Linguists Portal & Searching for Resources. Jan Odijk LOT Summerschool Nijmegen,

Integrating TEI and EAD

Contents. List of Figures. List of Tables. Acknowledgements

DIGITAL MARKETING For your Company

WebLicht: Web-based LRT Services in a Distributed escience Infrastructure

SharePoint 2013 End User Level II

CORLI. a linguistic consortium for corpus, language and interaction

Formats and standards for metadata, coding and tagging. Paul Meurer

On the way to Language Resources sharing: principles, challenges, solutions

Implementation of the Data Seal of Approval

Final Project Discussion. Adam Meyers Montclair State University

Modeling Linguistic Research Data for a Repository for Historical Corpora. Carolin Odebrecht Humboldt-Universität zu Berlin LAUDATIO-repository.

Annotating Spatio-Temporal Information in Documents

EUDAT. A European Collaborative Data Infrastructure. Daan Broeder The Language Archive MPI for Psycholinguistics CLARIN, DASISH, EUDAT

A Multilingual Social Media Linguistic Corpus

Knowledge Hub Walkthrough

clarin:el an infrastructure for documenting, sharing and processing language data

SharePoint 2013 End User Level II

Common Language Resources and Technology Infrastructure REVISED WEBSITE

Utilising ANNIS for search and analysis of historical data

verapdf: definitive, open source PDF/A validation for digital preservationists

Conversion and Annotation Web Services for Spoken Language Data in CLARIN

SocialMiner Configuration

Bringing Europeana and CLARIN together: Dissemination and exploitation of cultural heritage data in a research infrastructure

Data Replication: Automated move and copy of data. PRACE Advanced Training Course on Data Staging and Data Movement Helsinki, September 10 th 2013

Digital Messaging Center Feature List

Data Exchange and Conversion Utilities and Tools (DExT)

COLLABORATIVE EUROPEAN DIGITAL ARCHIVE INFRASTRUCTURE

Information Retrieval

Cf. Gasch (2008), chapter 1.2 Generic and project-specific XML Schema design", p. 23 f. 5

Unit 3 Corpus markup

UIMA-based Annotation Type System for a Text Mining Architecture

CLARIN s central infrastructure. Dieter Van Uytvanck CLARIN-PLUS Tools & Services Workshop 2 June 2016 Vienna

EuroParl-UdS: Preserving and Extending Metadata in Parliamentary Debates

7. METHODOLOGY FGDC metadata

verapdf Industry supported PDF/A validation

Technical and Financial Proposal SEO Proposal

Page Title is one of the most important ranking factor. Every page on our site should have unique title preferably relevant to keyword.

The Internet and World Wide Web. Chapter4

Background and Context for CLASP. Nancy Ide, Vassar College

This document is a preview generated by EVS

Implementing a Variety of Linguistic Annotations

1. CONCEPTUAL MODEL 1.1 DOMAIN MODEL 1.2 UML DIAGRAM

Lou Burnard Consulting

Reflexive Community Information Systems

Tools in Fronter, for student

Text Encoding Fundamentals: Element list

Gaggle 101 User Guide

WhereScape RED Installation

Metadata Workshop 3 March 2006 Part 1

Introduction to Text Mining. Aris Xanthos - University of Lausanne

Janne Bondi johannessen, Anders Nøklestad, Joel Priestley and Kristin Hagen. WP5: Glossa Integration

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

To start work, go to Settings to create folders, file archives, forum and web pages.

The Virtual Language Observatory!

Novell Vibe 3.4. Novell. July Quick Start. Starting Novell Vibe. Getting to Know the Novell Vibe Interface and Its Features

CompuScholar, Inc. Alignment to Utah's Web Development I Standards

The Road to 200,000 Downloads: The Cesium Story. Sarah

SEXTANT 1. Purpose of the Application

VisAssist Web Navigator

Working towards a Metadata Federation of CLARIN and DARIAH-DE

Traffic Triggers Domain Here.com

Content Management Features & Benefits

User s Guide Your Personal Profile and Settings Creating Professional Learning Communities

Parmenides. Semi-automatic. Ontology. construction and maintenance. Ontology. Document convertor/basic processing. Linguistic. Background knowledge

Introduction

Melvyl Webinar UC and OCLC Roadmap Discussion

What s new in SharePoint Search 2010 for end users. IW109 Mirjam van Olst

Annotation Science From Theory to Practice and Use Introduction A bit of history

C. M. Sperberg-McQueen Black Mesa Technologies LLC. Oliver Schonefeld Institut für Deutsche Sprache (IDS)

Best practices in the design, creation and dissemination of speech corpora at The Language Archive

Comp 336/436 - Markup Languages. Fall Semester Week 2. Dr Nick Hayward

Data for linguistics ALEXIS DIMITRIADIS. Contents First Last Prev Next Back Close Quit

Towards a roadmap for standardization in language technology

User Documentation. Studywiz Learning Environment. Student's Guide

Automatic Transcription of Speech From Applied Research to the Market

Types of Lesson Content

A Repository of Metadata Crosswalks. Jean Godby, Devon Smith, Eric Childress, Jeffrey A. Young OCLC Online Computer Library Center Office of Research

PDF statistics the universe of electronic documents

Introducing IBM Lotus Sametime 7.5 software.

An overview of the OAIS and Representation Information

Institutional Repository using DSpace. Yatrik Patel Scientist D (CS)

ICD Wiki Framework for Enabling Semantic Web Service Definition and Orchestration

CONTENT CALENDAR USER GUIDE SOCIAL MEDIA TABLE OF CONTENTS. Introduction pg. 3

Informatics 1: Data & Analysis

SHARING GEOGRAPHIC INFORMATION ON THE INTERNET ICIMOD S METADATA/DATA SERVER SYSTEM USING ARCIMS

ENCODING TEXTS FOR VISUALIZATION AND ANALYSES USING THE TEI STANDARD

Online Communication. Chat Rooms Instant Messaging Blogging Social Media

INDEPTH Network Introduction to NADA

THE GREAT CONSOLIDATION: ENTERTAINMENT WEEKLY MIGRATION CASE STUDY JON PECK, MATT GRILL, PRESTON SO

Transcription:

Converting and Representing Social Media Corpora into TEI: Schema and Best Practices from CLARIN-D Michael Beißwenger, Eric Ehrhardt, Axel Herold, Harald Lüngen, Angelika Storrer

Background of this talk: CMC 2 TEI Computer-mediated communication (CMC) / social media : interaction mediated through computer networks (the internet), e.g. chats, forums, communication on social network sites, blog comments, tweets, Wikipedia talk pages, SMS and whatsapp conversations etc. New genres which the TEI has not yet envisioned (citation from TEI-P5 chapter about customization) TEI-SIG computer-mediated communication (CMC-SIG): How can TEI-P5 be adapted for the representation of CMC genres and corpora? Options: 1) customization 2) extension of standard CMC: not a niche phenomenon but with a high (still growing) impact on people s everyday lives, interactions, language and society

Background of this talk: CMC 2 TEI TEI-SIG computer-mediated communication (CMC-SIG): Next planned step (spring 2017): feature request for a basic structure for CMC based on the models discussed and schemas developed so far Results of the SIG work so far: Three schemas for modeling CMC corpora in TEI Several existing corpora (French & German) for different CMC genres (Wikipedia discussions, SMS, Twitter, chat etc.) that have entirely been represented using these schemas Most recent schema: CLARIN-D CMC-TEI : focus especially (but not only) on written CMC genres developed and used for remodeling an existing chat corpus for German

Schemas and resources of the CMC-SIG http://wiki.tei-c.org/index.php/sig:computer-mediated_communication

Structure of the talk 1) Project background of the most recent schema ( CLARIN-D CMC-TEI ) 2) General challenge of modeling CMC in TEI 3) Outline of the schema on the example of selected features 4) Summary and outlook

ChatCorpus2CLARIN: Project background Curation project of the CLARIN-D F-AG 1 German Philology Period: May 2015 February 2016 Task: develop a workflow and resources for the integration of an existing chat corpus into the CLARIN-D research infrastructure for language resources and tools in the Humanities and the Social Sciences (http://clarin-d.de). One work package: remodeling of the corpus in a TEI-compliant format. Project team: Michael Beißwenger (U Dortmund / DUE), Angelika Storrer, Eric Ehrhardt (U Mannheim), Harald Lüngen (IDS Mannheim), Axel Herold (BBAW, Berlin) + other colleagues at the CLARIN-D hubs at IDS and BBAW. http://www.clarin-d.de/en/curationproject-1-3-german-philology

The original resource (chat corpus 1.0) Dortmund Chat Corpus http://www.chatkorpus.tu-dortmund.de scope: Language use and linguistic variation in German chats corpus size: 478 logfiles with 140240 user posts / 1 million words availability: online for download since 2005 (together with a simple query tool, addressees: linguists) + as a collection of HTML pages (for online browsing, addressees: German teachers) XML-annotated on basis of a homegrown XML format ( ChatXML ) which describes: (1) the basic structure and properties of logfiles and postings ( messages ) (2) selected items on the micro-level of user posts (emoticons, acronyms, addressing terms, nickname mentions) (3) selected metadata about the chat platforms and users.

Goals of the project (1) Sustainability: Integrate the only CMC corpus for German which is freely availabe (and which is used by many researchers) into the CLARIN-D corpus infrastructure Additional lingustic annotations (part-of-speech) that will allow or more sophisticated corpus queries Interoperability: Represent the corpus compliant to an established standard in the Digital Humanities and thus make it interoperable with other resources in the CLARIN-D corpus infrastructure

Goals of the project (2) Create a showcase which demonstrates what researchers can gain when CMC corpora are made available for the community, CMC corpora as part of big, annotated corpus collections can be analyzed in combination with other language resources (text and speech corpora at IDS and BBAW) Why TEI? Therefore! Intended model character of the solutions developed in the project: solutions should be useful not only for the modeling and integration of the chat corpus but also for the modeling and integration of other CMC corpora into CLARIN-D (future projects)

Modeling CMC in TEI: The challenge Fundamental challenge: Written CMC shares characteristics both with text and spoken conversation... o Just like spoken conversation (and different from text), CMC is dialogic interaction in which each communivative move creates/changes the context for follow-up moves. o Just like text documents and different from spoken conversation, written CMC is organized through the exchange of stretches of written text which have completely been composed before they are transmitted and read. A basic model for the representation of user contributions to written CMC should reflect these properties.

CLARIN-D CMC-TEI @TEI CMC SIG Schema ODD and RNG files: http://wiki.tei-c.org/index.php?title=sig:cmc/clarindschema

CLARIN-D CMC-TEI Customisation: Introduction of specific models for CMC-specific concepts: e.g model.divpart.cmc, <post>, @auto Definition of best practices for the use of standard TEI models (without modification; e.g. <div>, <w>,@type, <participantlist>, <timeline>

The basic element for CMC: <post> Post: a stretch of text characterized by the following features: (1) produced to function as a contribution to an ongoing CMC interaction; (2) the process of verbalization is prior to (and finished before) the act of making it available for the addressees/interaction partners; (3) it is the largest unit of user generated content that is handed over to the technical system at once; (4) it is the atomic building block of CMC macrostructures (logfiles, threads, twitter timelines, ) Example from a social chat: A04: milkaq: wir kennten dich dann auch im Beiboot aussetzen A11: OK. Ich kann ja schwimmen. A02: KäptnMcMike holt ein Haustierchen aus der Kajüte

Example: posts in chat

model.divpart.cmc (HTML from ODD)

<post> declaration (HTML from ODD)

Extending the <post> model with attributes post metadata giving information that the researcher has reconstructed through interpretation and analysis: @replyto indicates to which previous post the current post replies or refers to. optional attribute which can be used for the annotation of thread structures

Extending the <post> model with attributes Posts which have been created not by a human participant of the interaction but by the system should be marked as automatically generated. The same holds for elements in the content of posts which have been automatically created, e.g. user signatures in posts on Wikipedia talk pages. @auto new attribute in att.global (type: data.xtruthvalue). marks whether the content of the respective element was automatically generated.

Best practices: Usage of available attributes at <post> @type, @who, @synch, @rend, <post auto="true" rend="color:blue" synch="#f1101004.t046" type="event" who="#f1101004.a01_system" xml:id="f1101004.m137"> <time> 22:01 </time> <name corresp="#f1101004.a04" type="nickname"> [_MALE-TEACHER-A04_]</name> entered the room <name type="roomname">[_roomname_]</name> at <time> 22:01:55 </time> </post>

Best practices: POS tags Inline annotation using <w> and @type Tag set STTS-IBK (Beißwenger et al. 2015) <post auto="false" rend="color:black" synch="#f1101004.t047" type="standard" who="#f1101004.a03" xml:id="f1101004.m138"> <time> 22:02 </time> <anchor type="sentence_start"/> <w lemma="gut" type="adjd" xml:id="f1101004.m138.t1">gut</w> <w lemma="," type="$," xml:id="f1101004.m138.t2">,</w> <w lemma="die" type="pds" xml:id="f1101004.m138.t3">das</w> <w lemma="sollen" type="vmfin" xml:id="f1101004.m138.t4">sollte</w> [ ] </post>

Best practices: Chat metadata in <teiheader> We use <profiledesc>:<particdesc> for descriptions of the chat users this is in line with standardisation for speech corprora (ISO 24624, as of 2016) <profiledesc> <particdesc> <listperson> <person role="system" xml:id="f1101006.a01_system"> <persname type="nickname">system</persname> <sex evidence="estimated">system</sex> </person> <person role="expert" xml:id="f1101006.a02"> <persname type="nickname">[_male-expert- A02_]</persName> <sex evidence="estimated">male</sex>

Best practices: Chat metadata in <teiheader> Unlike in the standard for speech corpora, we put the <timeline> under <textdesc>:<interaction> In the timeline, absolute timepoints are recorded <textdesc> [ ] <interaction> <timeline> <when absolute="11:04:00" xml:id="f1101006.t001"/> <when absolute="11:11:00" xml:id="f1101006.t002"/> <when absolute="11:18:00" xml:id="f1101006.t003"/>

Best practices: Chat metadata in <teiheader> We used <recordingstmt> in <sourcedesc> for metadata about media (chat platform) <recordingstmt> <recording> <equipment> <p>plattformname=<name type="oth">[_chatplatform_]</name></p> <p>plattformurl=<ref type="url">[_wwwurl_]</ref></p> </equipment> [ ]

Result of the project: CLARIN-D-conformant resource Dortmund Chat corpus 2.0 # chat log files 470 # posts 131,033 # tokens 1,005,166 file Size (TEI-XML) 100MB

Availability of the integrated resource Integration in CLARIN-D repository at IDS: done Integration in CLARIN-D repository at BBAW: this week Downloadable only when full anonymisation is finished

Availability of the integrated resource To be integrated in the German Reference Corpus DeReKo at IDS and searchable through COSMAS II To be integrated in the DWDS corpus query interface at BBAW Will also be searchable via CLARIN web services and the CLARIN Federated Content Search

Summary and outlook (1) On the example of a schema developed for remodeling the Dortmund chat corpus we presented a rationale and selected models & best practices for the representation of CMC genres in TEI (based on models discussed and tested in the TEI-SIG computer-mediated communication ) Goal of this talk: Draw the attention of the TEI community the current state of the schemas developed by members of the SIG in order to ask for feedback, comments etc. on our ideas. Next milestone for the SIG: Submit a feature request on Github based on the schemas developed so far and the comments received from the community (feature requests planned for spring/summer 2017)

Summary and outlook (2): open issues How to encode <post>-specific meta data? In the project Chat2CLARIN, we chose to encode them using more and more attributes (regular and customised) <post auto="true" rend="color:blue" synch="#f1101004.t046" type="event" who="#f1101004.a01_system" xml:id="f1101004.m137"> [ ]

Summary and outlook (2): open issues How to encode <post>-specific meta data? However, you might alternatively want to encode them in a separate <postheader> in the <post>, containing e.g. feature structures <post auto="true" > <postheader> <fs> <f name= status"> <symbol value= delivered"/> </f> [ ]

Summary and outlook (2): open issues How to encode <post>-specific meta data? Or define <post> as a declaring element and point to post-specific metadata contained in declarable elements in the teiheader (cf. e.g. Stadler et al. 2016) <post auto="true decls= #correspdesc_01" > [ ]

Summary and outlook (3): Get in touch! Wiki page of the CMC-SIG with schemas (ODD & RNG), documentation of previous activities etc.: http://wiki.tei-c.org/index.php/sig:computer- Mediated_Communication Google Group (for feedback, comments etc.): https://groups.google.com/forum/#!forum/tei-cmc tei-cmc@googlegroups.com... or in the SIG meeting right after the coffee break!

SIG meeting

Converting and Representing Social Media Corpora into TEI: Schema and Best Practices from CLARIN-D Thank you! michael.beisswenger@uni-due.de luengen@ids-mannheim.de