Science 2.0 VU Big Science, e-science and E- Infrastructures + Bibliometric Network Analysis

W I S S E N n T E C H N I K n L E I D E N S C H A F T Science 2.0 VU Big Science, e-science and E- Infrastructures + Bibliometric Network Analysis Elisabeth Lex KTI, TU Graz WS 2015/16 u www.tugraz.at

Agenda Repetition from last time: altmetrics / altmetrics in practice Big Data and Science E-Science E-Infrastructures Bibliometric Network Analysis Your Assignment! 2

Altmetrics (repetition) Altmetric is the creation and study of new metrics based on the Social Web for analyzing and informing scholarship - Altmetrics Manifesto, http://altmetrics.org/about Aggregated from many sources (e.g. Twitter, Mendeley, github, slideshare,...) Article Level Metrics (ALM) multidimensional suite of transparent and established metrics at article level 3

Examples for Altmetrics sources (repetition) Usage Views, downloads,.. Captures Bookmarks, readers,.. Mentions Blog posts, news stories, Wikipedia articles, comments, reviews Social Media Tweets, Google+, Facebook likes, shares, ratings Citations Web of Science, Scopus, Google Scholar,... 4

Examples: Altmetric.com 5 Source: http://www.altmetric.com/details.php?domain=www.altmetric.com&citation_id=843656

Lessons learned (repetition) Alternative ways to assess impact of various scientific outputs No common understanding of altmetrics yet What do they really express? Are they useful and for which part of the research process? Not necessarily better metrics E.g. Gamification Can help to get an overview of a research field Visualizations based on altmetrics 6

Modern Science: What has changed? 150 years later: Searching for new particles like Higgs boson with the Large Hadron Collider Built in collaboration with over 10,000 scientists and engineers from over 100 countries, hundreds of universities and laboratories. In a tunnel of 27 km in circumference,175 m deep, near Geneva 7

Motivation Internet and science disciplines (e.g. physical sciences, biological sciences, medicine, and engineering) generate large and complex datasets (Big Data) require more advanced database and architectural support New kind of research methodology has emerged (fourth paradigm of scientific exploration (Hey, 2007) based on statistical exploration of big amounts of data 8 http://www.ksi.mff.cuni.cz/astropara/

Data intensive scientific discovery 9 http://research.microsoft.com/en-us/collaboration/fourthparadigm/4th_paradigm_book_complete_lr.pdf

Example: Big Data in Science - European Exascale Projects Exascale computing: computers capable of at least one exaflops (10 18 floating point operations per second) à Not yet achieved, currently 10 15 10 http://exascale-projects.eu

www.tugraz.at n Publications as Big Data CrossJournal Recommendation based on Click Streams [Bollen et al., 2009] 11

e-science Large scale science (since 1999) Data-driven discovery Focus on computationally intensive science and how to tackle it using highly distributed environments in collaborative manner Powerful computers: Supercomputers, High Performance Computing (HPC), Grid, Distributed Computing Powerful research infrastructures e-infrastructures, grids, clouds 12 http://www.anandtech.com/show/6421/inside-the-titan-supercomputer-299k-amd-x86-cores-and-186k-nvidia-gpu-cores/3

Supercomputers large, expensive systems, usually housed in a single room, in which multiple processors are connected by fast local network Suited for highly complex, real-time applications and simulation Pros: data can move between processors rapidly à all processors can work together on same tasks Cons: expensive to build and maintain. Do not scale well, e.g. adding more processors is challenging 13 http://www.top500.org/lists/2014/06/ http://www.wikihow.com/build-a-supercomputer

Distributed Computing systems in which processors are not necessarily located in close proximity to one another and can even be housed on different continents but which are connected via the Internet or other networks Pros: relative to supercomputers much less expensive. Cons: less speed achieved than with supercomputers 14

Example: Hadoop Ecosystem of tools for processing big data Simple computational model two-stage method for processing large data amounts design an algorithm for operating on one chunk of the data in two stages (a Map and a Reduce stage), MapReduce automatically distributes that algorithm to cluster à hides complexity in framework 15 http://hadoop.apache.org http://architects.dzone.com/articles/how-hadoop-mapreduce-works

Hadoop in escience: Example: Astronomical Image Processing Large telescopes survey sky over a prolonged period of time. Large Synoptic Survey Telescope LSST - under construction - will capture 1/2 of sky over 10 years - 30TB of data every night - ~60PBs in 10 years Astronomers pick out faint objects for study by capturing multiple images of same area and by combining them coaddition Challenge: how to organize and process all the resulting data. 16 http://www.lsst.org/lsst/

Using Hadoop to help with image coaddition 17 http://escience.washington.edu/get-help-now/astronomical-image-processing-hadoop

Virtual Science Environments Not only HPC but also sharing of knowledge and data is becoming a requirement for scientific discovery providing useful mechanisms to facilitate this sharing Preserve and organize research data à Virtual Science Environments: virtual environments in which researchers work together through ubiquitous, trusted and easy access to services for scientific data, computing and networking, enabled by e-infrastructures 18

Defining e-infrastructures European e- Infrastructure Reflection group (e-irg): The term e-infrastructure refers to this new research environment in which all researchers whether working in the context of their home institutions or in national or multinational scientific initiatives have shared access to unique or distributed scientific facilities (including data, instruments, computing and communications), regardless of their type and location in the world. 19 http://www.e-irg.eu/about-e-irg.html

e-infrastructures - Goals Opening access to knowledge through reliable, distributed and participatory data e-infrastructures Cost effective infrastructures for preservation and curation for re-use of data Persistent availability of information and linking people and data through flexible and robust digital identifiers Interoperability for consistency of approaches on global data exchange (e.g. standards) Enabling trust through authentication and authorisation mechanisms 20 http://cordis.europa.eu/fp7/ict/e-infrastructure/docs/framework-for-action-in-h2020_en.pdf

Example: e-infrastructure OpenAIRE The European Open Access Data Infrastructure for Scholarly and Scientific Communication Functionality: Harvesting and storing of information about publications from various repos (OAI-PMH) Enables searching for publications and related infos (e.g. funding,..) Provides list of OA repos that can be used to store publications Orphan repo Shows statistics of stored data 21 https://www.openaire.eu

OpenAIRE - Applications 22

Example: e-infrastructures Austria 1/2 23 http://www.e-infrastructures.at

Example: e-infrastructures Austria 2/2 24

Take away message Big Science / e-science: data-driven, large scale science Supercomputers and distributed computing Virtual research environments e-infrastructures 25

Bibliometric Network Analysis 26

Bibliometrics Quantitative study of all kinds of bibliographic data Patterns of authorship, publications, citations E.g: citation analysis of research outputs/publication Assess research impact of individuals, groups, institutions Measuring by Author (H Index), Article (Plos), or Publication (Journal Impact Factor) Measure of Output not Quality (Quantitative Not Qualitative!) Other measures could include funding received, number of patents, awards granted, or qualitative measures such as peer review 17/04/2015 Maynooth University

Why use Bibliometrics? Measure impact of research/publishing activity CV, promotion, tenure, grants, feedback to funding bodies/ industry/public Showcase Individual/Group/Institutional Research identify Areas of Research Strengths/Weaknesses Inform Research Priorities Identify highest impact or top performing Journals in a Subject Area Where to Publish, learning about a particular subject area, identify emerging areas of research Identify the top researchers in a subject area Collaborations/Competitors Recruitment Learning about a subject area 17/04/2015 Maynooth University

Bibliometric Networks Represent scientific literature based on bibliographic data in form of networks Helps providing overview of structure of scientific literature e.g. in a domain or wrt a topic Applications Identify main research areas within a field Analyze relationship between research areas 29

Bibliometric Networks Co-authorship networks Citation networks Co-citation networks Co-occurence maps Keywords, extracted topics,.. 30

Co-authorship Networks Scientific collaboration network Nodes are authors of publications Link between authors if they co-authored a publication Collaboration networks are scale-free Co-authorship networks are Affiliation Networks 31

Co-authorship networks: Example 32

Citation Networks Nodes are publications Link between nodes if publications cite each other Reveals how often articles were cited 33

Citation Networks 34 http://eduinf.eu/2012/03/15/co-citation-analysis-of-the-topic-social-network-analysis/

Co-Citation Networks Nodes are publications Links between nodes if two publications were cited together in a paper How often two articles were cited by some third article OR: nodes are authors Links between nodes if authors were cited together To identify clusters of authors 35

www.tugraz.at n Author co-citation network of 15 history & philosophy of science journals. Two authors are connected if they are cited together in some article, and connected more strongly if they are cited together frequently 36 http://www.scottbot.net/hial/?p=38272

Mining in Scientific Networks Find influential researchers Find influential papers Investigate patterns of scientific collaboration... 37

Centrality Measures Degree Centrality equals to number of links (connections) a node has à In citation networks papers that have high in-degree centrality have a lot of citations à Widely used metric for measuring the scientific impact of a paper 38

Centrality Measures Extension of degree centrality Degree centrality awards one centrality point for every neighbor a node has However, not all neighbors are equally important In many cases importance of node increased by having connections to other nodes that are themselves important Eigenvector centrality: not only count of neighbors is important but also the importance of the neighbors Eigenvector centrality gives each node score proportional to the sum of the scores of its neighbors 39

Centrality Measures in Python https://networkx.github.io/documentation/latest/ reference/algorithms.centrality.html 40

Summary Big Science E-Science E-Infrastructure Bibliometrics Bibliometric Network Analysis 41

Thank you for your attention! 42