What is Text Mining? Sophia Ananiadou National Centre for Text Mining University of Manchester

Similar documents
Text mining tools for semantically enriching the scientific literature

National Centre for Text Mining NaCTeM. e-science and data mining workshop

Powering Knowledge Discovery. Insights from big data with Linguamatics I2E

Historical Text Mining:

Digital repositories as research infrastructure: a UK perspective

Customisable Curation Workflows in Argo

WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.

OKKAM-based instance level integration

Maximizing the Value of STM Content through Semantic Enrichment. Frank Stumpf December 1, 2009

Turning Text into Insight: Text Mining in the Life Sciences WHITEPAPER

Content Enrichment. An essential strategic capability for every publisher. Enriched content. Delivered.

Enabling Open Science: Data Discoverability, Access and Use. Jo McEntyre Head of Literature Services

PROJECT PERIODIC REPORT

Unstructured Text in Big Data The Elephant in the Room

SAPIENT Automation project

Exploring the Generation and Integration of Publishable Scientific Facts Using the Concept of Nano-publications

SEMANTIC WEB POWERED PORTAL INFRASTRUCTURE

UK Institutional Repository Search Project

Information mining and information retrieval : methods and applications

TEXT MINING: THE NEXT DATA FRONTIER

Overview of Web Mining Techniques and its Application towards Web

TURNING TEXT INTO INSIGHT: TEXT MINING IN THE LIFE SCIENCES

An Approach To Web Content Mining

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

Empowering People with Knowledge the Next Frontier for Web Search. Wei-Ying Ma Assistant Managing Director Microsoft Research Asia

Cheshire 3 Framework White Paper: Implementing Support for Digital Repositories in a Data Grid Environment

24 th July 2007 HL7 Conference London. Dr Lester Russell. Chief Medical Officer. NHS Account Fujitsu Services

The Final Updates. Philippe Rocca-Serra Alejandra Gonzalez-Beltran, Susanna-Assunta Sansone, Oxford e-research Centre, University of Oxford, UK

Data Management Glossary

Information Retrieval

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

MIND THE GOOGLE! Understanding the impact of the. Google Knowledge Graph. on your shopping center website.

Monthly SEO Report. Example Client 16 November 2012 Scott Lawson. Date. Prepared by

The Metadata Assignment and Search Tool Project. Anne R. Diekema Center for Natural Language Processing April 18, 2008, Olin Library 106G

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

COURSE LISTING. Courses Listed. Training for Database & Technology with Modeling in SAP HANA. Last updated on: 30 Nov 2018.

GoDaddy Blog: Learn How To Deploy Your Blog Globally - One Market at A Time

Opus: University of Bath Online Publication Store

Ontology Extraction from Heterogeneous Documents

Web Mining Evolution & Comparative Study with Data Mining

Parmenides. Semi-automatic. Ontology. construction and maintenance. Ontology. Document convertor/basic processing. Linguistic. Background knowledge

Information Retrieval, Information Extraction, and Text Mining Applications for Biology. Slides by Suleyman Cetintas & Luo Si

Toward a Knowledge-Based Solution for Information Discovery in Complex and Dynamic Domains

Who is behind terrorist events?

Semantic Technologies for Nuclear Knowledge Modelling and Applications

This report is based on sampled data. Jun 1 Jul 6 Aug 10 Sep 14 Oct 19 Nov 23 Dec 28 Feb 1 Mar 8 Apr 12 May 17 Ju

Enhanced retrieval using semantic technologies:

The GENIA corpus Linguistic and Semantic Annotation of Biomedical Literature. Jin-Dong Kim Tsujii Laboratory, University of Tokyo

Things to consider when using Semantics in your Information Management strategy. Toby Conrad Smartlogic

VISO: A Shared, Formal Knowledge Base as a Foundation for Semi-automatic InfoVis Systems

EBP. Accessing the Biomedical Literature for the Best Evidence

From Open Data to Data- Intensive Science through CERIF

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

Getting in Gear with the Service Catalog

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

Text Mining. Representation of Text Documents

Smart Protection Network. Raimund Genes, CTO

Text Mining: A Burgeoning technology for knowledge extraction

Linked Data in Archives

Proven video conference management software for Cisco Meeting Server

Software Requirements Specification for the Names project prototype

An Oracle White Paper October Oracle Social Cloud Platform Text Analytics

COURSE LISTING. Courses Listed. Training for Database & Technology with Modeling in SAP HANA. 20 November 2017 (12:10 GMT) Beginner.

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Understanding the workplace of the future. Artificial Intelligence series

KNOWLEDGE GRAPH: FROM METADATA TO INFORMATION VISUALIZATION AND BACK. Xia Lin College of Computing and Informatics Drexel University Philadelphia, PA

Acquiring Experience with Ontology and Vocabularies

University of Bath. Publication date: Document Version Publisher's PDF, also known as Version of record. Link to publication

Practical Machine Learning Agenda

A Framework for BioCuration (part II)

Scuola di dottorato in Scienze molecolari Information literacy in chemistry 2015 SCOPUS

Extending the Facets concept by applying NLP tools to catalog records of scientific literature

Discovery Net : A UK e-science Pilot Project for Grid-based Knowledge Discovery Services. Patrick Wendel Imperial College, London

Semantic MediaWiki (SMW) for Scientific Literature Management

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

LIDER Survey. Overview. Number of participants: 24. Participant profile (organisation type, industry sector) Relevant use-cases

Semantic Web Mining and its application in Human Resource Management

Use of Semantic Technologies at Eli Lilly and Company. J Phil Brooks Information Consultant, SE Data Team Discover IT Eli Lilly and Company

A Semantic Web for Bioinformatics: Goals, Tools, Systems, Applications

Database and Knowledge-Base Systems: Data Mining. Martin Ester

Outline. 1 Introduction. 2 Semantic Assistants: NLP Web Services. 3 NLP for the Masses: Desktop Plug-Ins. 4 Conclusions. Why?

Customising Location of Knowledge. Ann Apps and Ross MacIntyre MIMAS, The University of Manchester, UK

Robin Wilson Director. Digital Identifiers Metadata Services

Semi-Automatic Conceptual Data Modeling Using Entity and Relationship Instance Repositories

For the mobile people of. Oulu

An Entity Name Systems (ENS) for the [Semantic] Web

Chapter 2. Architecture of a Search Engine

Searching the Deep Web

Managing Learning Objects in Large Scale Courseware Authoring Studio 1

e-sens Nordic & Baltic Area Meeting Stockholm April 23rd 2013

Search Engines Information Retrieval in Practice

Web Services Take Root in Banks and With Asset Managers

Getting Security Operations Right with TTP0

Information Retrieval (IR) through Semantic Web (SW): An Overview

Engineering education knowledge management based on Topic Maps

Natural Language Processing with PoolParty

COURSE LISTING. Courses Listed. with Governance, Risk and Compliance (GRC) SAP BusinessObjects. 19 February 2018 (15:13 GMT) GRC100 -

ANNUAL REPORT Visit us at project.eu Supported by. Mission

Languages and tools for building and using ontologies. Simon Jupp, James Malone

Transcription:

National Centre for Text Mining www.nactem.ac.uk University of Manchester

Outline Aims of text mining Text Mining steps Text Mining uses Applications 2

Aims Extract and discover knowledge hidden in text automatically Aid domain experts by automatically: identifying concepts extracting facts/relations discovering implicit links generating hypotheses 3

Why do we need text mining? Too much information! Information overload Growth of searches Searches (millions) 90 80 70 60 50 40 30 20 10 0 Jan-97 Aug-97 Mar-98 Medline searches over time Oct-98 May-99 Dec-99 Jul-00 Feb-01 Sep-01 Month/year Apr-02 Nov-02 Jun-03 Jan-04 Aug-04 Mar-05 Oct-05 Information overlook Many heterogeneous resources Text (blogs, papers, surveys, news..) Databases Web Ontologies 4

The role of text in information management We are inundated with huge amounts of data Unstructured information (text) Different text types, genres, domains.. Semi-structured information (XML + text) Structured information (databases) We need to make sense of data We need to manage information and knowledge effectively 5

UK National Centre for Text Mining (NaCTeM) The 1 st national text mining centre in the world www.nactem.ac.uk Remit: Provision of text mining services to support UK research Funded by: the JISC, BBSRC, EPSRC Phase I (2005-2008): biology Phase II (2008-2011) : bio-medicine, social sciences humanities 6

Why is there a need in the UK for a national centre for text mining? Some researchers knew they wanted TM TM key component of e-science Involve more researchers (from all domains) in doing e-science and e- research TM seen as key technology for researchers And one applicable in every domain (broad interest/support from major funding bodies) 7

Embedding Text Mining within e- Science in the UK e-science [ ] enables new research and increases productivity through shared e-infrastructure, the development of computational and logical models and new ways to discover and use the growing range of distributed and interoperable resources. It supports multidisciplinary and collaborative working and a culture that adopts the emerging methods. M. Atkinson (2007) Beyond e-science 8

What the users want to do with their data (minimally) Easier access to data Sharing data with their peers Annotating data with metadata TEXT MINING RIDES TO THE RESCUE Managing data across locations Integrating data within workflows, Web Services Aids for semantic metadata creation; enriching data with related metadata e.g. experimental results 9

Information Retrieval Semantic metadata From Text to Knowledge: tackling the data deluge through text mining Unstructured Text (implicit knowledge) Information extraction Knowledge Discovery Advanced Information Retrieval Structured content (explicit knowledge)

Text Mining Steps (1/3) Information Retrieval (Google) Finding within a large document collection, a subset of documents, relevant to a user s query Query term blood or Boolean query blood pressure Too many documents are retrieved, prohibitively large for humans to read Many retrieved documents are irrelevant to our query Many relevant documents are missing 11

Text Mining Steps (2/3) Information Extraction, nuggets of text Identify information nuggets from text, fill existing templates, create structured information, populate text databases Slot Date Location Victim injured Victim attacked Perpetrator Information 7/10/96 (today) SanSalvador policeman guards urban guerrillas San Salvador, 7/10/96 It has been officially reported that a policeman was wounded today when urban guerrillas attacked the guards at a power substation located downtown San Salvador. 12

Information Extraction recognise specific relations and events (typically expressed by verbs attacked) domain restricted (newswire, biology ) more complex NLP techniques more than bag of words approach deep parsing Output: filled templates of entities and facts IE extracts only what we are looking for, i.e. what has been defined by patterns 13

Text Mining Steps (3/3) Data Mining: finds associations among pieces of information extracted from many different texts, implicit links Integration with databases, ontologies 14

Integrating Text with DBs and Ontologies Databases Semantic Interpretation of data Adding new knowledge Ontological resources Supporting semantics text

Uses (1/ Business applications Business intelligence (market analysis), competitors, identify risks, make predictions Customer views and opinions from blogs (opinion mining, trends analysis) Find nuggets of relevant information immediately, systematically Remove tedious process of finding information 16

Uses (2/3) Media Semantic searching needed for new models of information access, linking and extraction Personalisation of searching Document classification and clustering based on personalised queries Social networking + text mining Topic clusters of news Frame analysis of news 17

Uses (3/3) Text Mining enriches text with semantic annotations For authors tools for semantic annotation Intelligent information management For publishers enrichment of digital libraries For scientists intelligent searching, linking and integration of text, databases 18

Applications extracting terms 19

Terms derived from text used as Index terms

Problems term variation & ambiguity Acronyms ER estrogen receptor emergency room endoplasmic reticulum Spelling Tumour tumor Oestrogen - estrogen NF-kB, NF-KB, NF-kappa B, nuclear factor kappa B Gene terms may be also common English words BAD human gene encoding BCL-2 family of proteins bad news, bad prediction 21

Semantic searching based on named-entity annotation The peri-kappa B site mediates human immunodeficiency DNA virus virus type 2 enhancer activation in monocytes cell_type Entity types (defined by Ontologies) Genes/protein names Enzymes, substances, etc. Names (people, organisations ) 22

KLEIO architecture 23

Fewer documents with more precise query

Extracting associations between entities Click! 26

Alzheimer's disease and schizophrenia. Interestingly, nicotine and similar compounds have been shown to enhance memory function and increase the expression of nachrs and therefore, could have a therapeutic role in the aforementioned diseases.

Text mining for social sciences ASSERT project (NaCTeM-NCeSS-EPPI) Assisting the process of systematic reviewing in social sciences Engaging the user community: EPPI (Evidence for Policy and Practice Information and Co-ordinating Centre) Document classification Information extraction Summarisation 28

The process of Systematic Reviewing Searching: extensive searches to locate as much relevant research as possible according to a query. Screening: narrows the scope of search to only the relevant documents to a specific review. Highlights key evidence and results that may impact on the policy. Synthesizing: correlates evidence from a plethora of resources and summarises the results. The process is mainly manual 29

Overview of ASSERT Document Collections Document Clustering Document Classification Multi-Document Summarisation Search Screen Synthesize Query Expansion Term Extraction Document Sectioning Sentence Extraction 30

BBC Pilot project Analyse, structure and visualise BBC news online, according to a user s query using advanced text mining techniques Concept discovery and retrieval interface allows a user to enter a query across the document collection and automatically calculate a list of concepts specific to the query and ranked by perceived importance. 31

Finding news using text mining 32

Clustering the documents 33

Visualising the results 34

Benefits to Users Provision of a focused search with goal based results Allows expansion beyond known keywords for a more complete search Visualization of a result set creates an overview of the research in the domain Saves time and effort 11 th April 2008 What ASSERT is Text Project Mining? 35

What do our users/clients use our services for? Creation of controlled vocabularies, extraction of interactions, creation of models and networks, database curation (BOOTStrep) Bibliographic searching, automatic classification and semantic extraction in support of subject searching (ASSERT, INTUTE) Ontology building Media frame analysis (ASSIST) Semantic extraction to support indexing and linking across repositories (INTUTE) Extracting bio-processes, gene-disease mining (Pfizer) Maintaining and constructing pathways (REFINE) Classification of on-line news feeds, document classification Topic detection (BBC) 36

Text Mining for Knowledge Discovery Semantic enrichment relies on NLP based TM technologies Applications based on semantically annotated texts enable knowledge discovery Linking text with domain knowledge Integrate other knowledge sources, ontologies, terminologies Integration of distributed TM software (workflows) 37

38