Science 2.0 VU Processing Science 2.0 Data, Content Mining

Size: px

Start display at page:

Download "Science 2.0 VU Processing Science 2.0 Data, Content Mining"

Jeffery Lyons
5 years ago
Views:

1 W I S S E N n T E C H N I K n L E I D E N S C H A F T Science 2.0 VU Processing Science 2.0 Data, Content Mining Elisabeth Lex KTI, TU Graz WS 2015/16 u

2 Agenda Repetition from last time: Open Science Processing academic resources Mining in academic resources (content perspective) Example: ContentMine: Extraction of scientific facts 2

3 Repetition: Open Science Open Science Ideas, Concepts, Benefits and Pitfalls E.g. Enhancing collaboration and communitybuilding, increasing efficiency of research vs no reward system yet Open Data Sharing your data influences how often you get cited (Piwowar, et al., 2007 and Pinowar, et a., 2013) Different models for Open Access Green vs. Gold vs. Hybrid 3

4 Open Science 5 schools of thought 4

5 Example: Open Government Data: Eurostat I d like to compare the unemployment rate in Austria with other European ones Via Google Public Data Explorer, 5

6 Open Access in Science: Open Access Journals Green ( self-archiving): author can self-archive at the time of submission of the publication whether the publication is grey literature (usually internal non-peer-reviewed), a peer-reviewed journal publication, a peer-reviewed conference proceedings paper or a monograph Gold ( author pays ): the author or author institution can pay a fee to the publisher at publication time, the publisher then makes the publication available 'free' at the point of access. further little-used road hybrid forms: for example platinum open access (does not charge author fees)... Both green and gold are compatible and can co-exist Source: Jeffery, K. Open Access: An Introduction,

7 Processing Academic Resources 7

8 Motivation Aggregate scientific results Exploratory search in digital collections Find experts in domains Make science discoverable Improve access to scientific publications Extract facts for research Discover relationships Check for errors => improve science

9 How? Aggregate and manage data: repositories, aggregators, datasets,... Mining in Academic Resources Information Extraction Topic Modeling Clustering/Classification Linking publications Make available data and source code J 9

10 KDD Process 10

11 How? Aggregate and manage data: repositories, aggregators, datasets,... Mining in Academic Resources Information Extraction Topic Modeling Clustering/Classification Linking publications Make available data and source code J 11

12 Datasets The European Library Open Dataset Digital collection and 200 mio bibliographic records opendata Datahub.io E.g. DBLP Computer Science Bibliography Metadata of over 1.8 mio publications by 1 mio authors 12

13 Repositories and Aggregators ISI Web of Science Scopus Pubmed The European Library Library of Congress ArXiv Figshare Data Citation Index Mendeley Google Scholar CiteSeerX... 13

14 APIs to Repositories... APIs to access scientific publications and research data ropensci: arxiv, PlosOne, Figshare Mendeley: Developer API, Python package: pip install mendeley 14

15 Example - ropensci 15

16 How? Aggregate and manage data: repositories, aggregators, datasets,... Mining in Academic Resources Information Extraction Topic Modeling Clustering / Classification Linking publications Make available data and source code J 16

17 Information Extraction IE Goal: Extract structured information out of unstructured content, e.g. Method names, quantities, temporal expressions Authors from scientific publications Organizations in acknowledgements section of papers References... 17

18 IE Process ORGANIZATION, PERSON, LOCATION, DATE, TIME, MONEY, and GPE (geo-political entity) Applying word classes to words within a sentence Input: raw text of a document Output: list of (entity, relation, entity) 18

19 IE Standard Approaches (1/2) Regular expressions / Rule-based approaches E.g. dates, RT@user 19

20 IE as Machine Learning Task Supervised: train model with annotated training data, use trained model to classify unknown text Choose a class label for a given input Identify features of language data to classify it Construct language models out of them Learn about text/language from these models Methods: Classifiers: Naive Bayes, Maxent Models Sequence models: Hidden Markov Models, CRFs 20

21 Libraries NLTK ( 21

22 Mining academic documents Extraction of structural elements Tables, figures,.. Extraction of facts from structural elements and doc Named Entity Recognition (e.g. gene names,..) Relation extraction (e.g. system A impacts system B) Mostly: PDF format Good for presentation but problems with metadata quality, hard to analyse While PDF analysis tools exist, there is still room for improvement! 22

23 Approach Divide and conquer Extracting blocks from the PDF based on structure and layout information Classify the extracted blocks E.g. into title, body, references, abstract,.. Classify content of extracted blocks E.g. tables Extract relevant info from the content (Named Entities, nouns, dates, quantities,..) 23

24 Approach Extracting blocks Features: layout specific such as position, font, font size,.. Apply Machine Learning approches Unsupervised (clustering) Supervised (classification) 24

25 Unsupervised Approach Clustering: given a set of objects find the groupings of objects so that the similarity within a group is maximized and the similarity between groups is minimized Cluster = block Successive merge and split mechanism 25

26 Supervised Approach Classification: given a set of labeled examples, create a model and use it to predict the label of unknown examples Classify blocks: Maximum Entropy Models Create training data by labeling blocks, i.e. assigning blocks to classes Learn a model based on the training data and apply it to classify unknown blocks Features: layout, formatting, word frequencies,.. 26

27 Fact Extraction from Publications Extract entities from within the identified blocks E.g. author block divide further to extract all authors contained in the block Extract relations between entities Open Information Extraction Learns a models without needing training data Can extract binary relations from sentences 27

28 Example: Measuring quality of Wikipedia Unbalanced Balanced Measure Value [%] Value [%] Accuracy F-Measure Precision Recall Elisabeth Lex, Michael Voelske, Marcelo Errecalde, Edgardo Ferretti, Leticia Cagnina, Christopher Horn, Benno Stein, and Michael Granitzer Measuring the quality of web content using factual information. In Proceedings of WebQuality '12 at WWW 12

29 Extract Topics from Publications Topic Models: algorithms that uncover thematic structure in document collections Facilitate searching, browsing, summarizing Latent Dirichlet Allocation (LDA) Hierarchical probabilistic model 29 18/11/15

30 LDA Probabilistic model that helps find latent topics for documents Probabilistic model: treat data as observations that stem from a generative proabilistic process which involves hidden variables Documents: Thematic structure are the hidden variables Each topic is described by words in the documents 30 18/11/15

31 LDA Probability of ith word for doc d Probability of ti within topic zi Probability of using a word from topic zi in the doc 31 Infer hidden structure using posterior inference What are the topics that describe the documents? Classify unknown data using the topic model How does unknown data fit into estimated topic structure? Nr of topics Z has to be choosen in advance Defines level of specification of topics 18/11/15

32 Example: Model evolution of topics over time in Science journal Dataset: pages Science from from JSTOR archive 32 18/11/15

33 Validation of extracted information Crowdsourcing as a way to evaluate mining quality Share the extracted information via e.g. a Webbased platform Enable users to give feedback Accept, reject, suggest new concepts/facts 33

34 HowTo: Text Mining using ropensci Library that facilitates text mining on publications Search for articles Fetch articles Get links for full text articles (xml, pdf) Extract text from articles / convert formats Collect bits of articles that you actually need Download supplementary materials from papers Chamberlain Scott (2015). fulltext: Full Text of Scholarly Articles Across Many Data Sources. R package version

35 Example: Text Mining using ropensci #include the library! library("fulltext )! #ft_search() - get metadata on a search query.! > (res1 <- ft_search(query = 'open science', from = 'arxiv'))! > (out <- ft_get(res1))! > res1$arxiv!! # ft_get() - get full or partial text of articles.! > res <- ft_get('cs/ v1', from='arxiv')!! #extract the fulltext! > res2 <- ft_extract(res)! > res2$arxiv$data!! #extract interesting parts from the fulltext! > out %>% chunks("doi")! 35

36 Example: Text Mining using ropensci fulltext can extract parts of a paper via chunks(): all, front, body, back, title, doi, categories, authors, keywords, abstract, executive_summary, refs, refs_dois, publisher, journal_meta, article_meta, acknowledgments, permissions, history! Can do PDF extraction E.g. via GhostScript: (res_gs <- ft_extract(pdf, "gs"))!

37 How? Aggregate and manage data: repositories, aggregators, datasets,... Mining in Academic Resources Information Extraction Topic Modeling Clustering/Classification Linking publications Make available data and source code J 37

38 Clustering of Academic Resources Detect groupings of papers based on content similarity E.g. alongside of topics Transform content (e.g. abstract of a paper) into machine readable representation Bag of Words approach: document treated as bag of words/terms, represented as vector Document-Term matrix: term frequencies across all documents 38

39 Vector Space Model Documents are vectors in Term- Document Space Elements of vector are weights wij corresponding to doc i and term j Weights: frequencies of terms in docs TF-IDF Proximity of documents (similarity) calculated by cosine of angle between document vectors 39

40 Example: Facilitate exploratory search By topic of interest (cluster = topic of interest) Setting: Social bookmarking dataset, URLs described by tags Research Questions: What clusters (aka groups of interests) exist? Are they somehow related? How do they evolve over time?

41 Clustering Algorithms KDD lectures! Here, briefly: K-means algorithm 1. Select k points as initial centroids 2. Repeat 3. Form k clusters by assigning all points to closest centroid 4. Recompute centroid of each cluster 5. Until centroids don t change 41 18/11/15

42 Example n

44 Classification of Scientific Publications Categorize into established subject-based taxonomy E.g. Library of Congress UNESCO thesaurus DOAJ subject classification Library of Congress Subject Headings 44

45 How? Aggregate and manage data: repositories, aggregators, datasets,... Mining in Academic Resources Information Extraction Topic Modeling Clustering/Classification Linking publications Make available data and source code J 45

46 Linking Scientific Publications Citations (explicitely defined) Similarity Statistical similarity: cosine Semantic similarity: more complex, e.g. via topics Usage Argument support Contradiction... 46

47 n Linking via Citations 47

48 How? Aggregate and manage data: repositories, aggregators, datasets,... Mining in Academic Resources Information Extraction Clustering / Classification Linking publications Search Make available data and source code J 48

49 Sharing code Github Bitbucket ipython Notebooks... 49

50 Example: ContentMine Idea: facts cannot be copyrighted Billion of facts in copyrightprotected research articles à Make them publicly accessible! 50

51 Possible questions for ContentMine Find references to papers by a given author. This is metadata and therefore factual. It is usually trivial to extract references and authors. More difficult, of course to disambiguate. Find who sponsors research. Extract acknowledgements and perform Named Entity Recognition to detect companies. Link the companies to the papers where they are listed in the acknowledgement 51

52 Machine Extraction of scientific facts n 1. Crawl scientific literature 2. Scrape each scientific article 3. Extract facts 4. Index 5. Republish (WikiData)

53 Example: retrieve metadata for specific article 53 18/11/15

54 Content Mining Problems Secondary publishers create walled gardens E.g. ResearchGate portal Publishers contracts ban content-mining. Publishers may cut off universities who mine Publishers lobby governments to require licences for content mining UK à the right to read is the right to mine

55 Summary Aggregators/repos for scientific publications Mining content/data in publications Information / fact extraction Topic modeling Clustering E.g. Exploratory analysis of large datasets Find groups of interest expressed by user generated tags and their relations ContentMine as example 55

56 Questions? See you next week! 56

Your Open Science and Research Publishing Platform. 1st SciShops Summer School

Your Open Science and Research Publishing Platform 1st SciShops Summer School to researchers? to Open Science? Personal / project / community profile Thematic / personal / project repositories Enriched