Application of machine learning and big data technologies in OpenAIRE system

Similar documents
Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

GDPR Data Discovery and Reporting

Innovatus Technologies

Inge Van Nieuwerburgh OpenAIRE NOAD Belgium. Tools&Services. OpenAIRE EUDAT. can be reused under the CC BY license

Big Data Hadoop Stack

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

OpenAIRE From Pilot to Service The Open Knowledge Infrastructure for Europe

OpenAIRE From Pilot to Service

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture

Spotfire Data Science with Hadoop Using Spotfire Data Science to Operationalize Data Science in the Age of Big Data

Processing of big data with Apache Spark

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

OpenAIRE Open Knowledge Infrastructure for Europe

Hadoop & Big Data Analytics Complete Practical & Real-time Training

Big Data and Hadoop. Course Curriculum: Your 10 Module Learning Plan. About Edureka

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

Preserving Websites Of Research & Development Projects

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

Oracle Big Data. A NA LYT ICS A ND MA NAG E MENT.

Sempala. Interactive SPARQL Query Processing on Hadoop

Table 1 The Elastic Stack use cases Use case Industry or vertical market Operational log analytics: Gain real-time operational insight, reduce Mean Ti

Introduction to Hadoop and MapReduce

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Oracle Big Data Connectors

New Features and Enhancements in Big Data Management 10.2

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

microsoft

Big Data Architect.

Big Data Hadoop Course Content

Certified Big Data and Hadoop Course Curriculum

Natalia Manola University of Athens ATHENA Research and Innovation Center. OpenAIRE. An Open Knowledge & Research Information Infrastructure

Exam Questions

Hadoop. Introduction / Overview

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

Oracle Big Data Fundamentals Ed 1

MixApart: Decoupled Analytics for Shared Storage Systems. Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp

exam. Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0

Strategies for Incremental Updates on Hive

Big Data with Hadoop Ecosystem

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

MAPR DATA GOVERNANCE WITHOUT COMPROMISE

Oracle Data Integrator 12c: Integration and Administration

OpenAIRE. Fostering the social and technical links that enable Open Science in Europe and beyond

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

Social Network Analytics on Cray Urika-XA

Enabling Secure Hadoop Environments

Smart Data Catalog DATASHEET

The Reality of Qlik and Big Data. Chris Larsen Q3 2016

Modern Data Warehouse The New Approach to Azure BI

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

A Fast and High Throughput SQL Query System for Big Data

Microsoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

This is a brief tutorial that explains how to make use of Sqoop in Hadoop ecosystem.

Open Access to Publications in H2020

Oracle Data Integrator 12c: Integration and Administration

Chase Wu New Jersey Institute of Technology

HAWQ: A Massively Parallel Processing SQL Engine in Hadoop

Big Data Analytics using Apache Hadoop and Spark with Scala

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

IBM BigInsights on Cloud

Microsoft Big Data and Hadoop

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Introduction to BigData, Hadoop:-

Tuning Enterprise Information Catalog Performance

Automation of Rolling Upgrade for Hadoop Cluster without Data Loss and Job Failures. Hiroshi Yamaguchi & Hiroyuki Adachi

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Distributed Computation Models

MapR Enterprise Hadoop

OpenINTEL an infrastructure for long-term, large-scale and high-performance active DNS measurements. Design and Analysis of Communication Systems

Cisco and Cloudera Deliver WorldClass Solutions for Powering the Enterprise Data Hub alerts, etc. Organizations need the right technology and infrastr

Hadoop Development Introduction

Configuring and Deploying Hadoop Cluster Deployment Templates

Oracle Big Data Fundamentals Ed 2

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

The OpenAIREplus Project

@Pentaho #BigDataWebSeries

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

HDP Security Overview

HDP Security Overview

Resource and Performance Distribution Prediction for Large Scale Analytics Queries

HDInsight > Hadoop. October 12, 2017

Deliverable Final Data Management Plan

Large Scale Processing with Hadoop

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

Tuning Intelligent Data Lake Performance

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

Microsoft Exam

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

A Web Service for Scholarly Big Data Information Extraction

Certified Big Data Hadoop and Spark Scala Course Curriculum

Horizon Societies of Symbiotic Robot-Plant Bio-Hybrids as Social Architectural Artifacts. Deliverable D4.1

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

Big Data Development HADOOP Training - Workshop. FEB 12 to (5 days) 9 am to 5 pm HOTEL DUBAI GRAND DUBAI

Introduction to Big-Data

GraphBuilder: A Scalable Graph ETL Framework

Distributed File Systems II

Transcription:

Application of machine learning and big data technologies in OpenAIRE system Warsztaty Orange z cyklu Centrum Badawczo Rozwojowe zaprasza Mateusz Kobos, ICM, Univeristy of Warsaw Warszawa, 2017-05-10

OpenAIRE system What does the system do? Gathers various scholarly information (publications, datasets, persons, fundings, and organizations) from publicly available sources (repositories, Current Research Information Systems). Presents the aggregated and de-duplicated view to the user through www.openaire.eu portal. Who are its users? Scientists the portal provides tools for dissemination and discovery of research results. Funding bodies and organizations the portal provides tools for measuring and refining funding investments in terms of their research impact. Who develops it? A consortium of European research organizations and founded by European Union.

Screenshot of the OpenAIRE portal 12 10 8 Column 1 Column 2 Column 3 6 4 2 0 Row 1 Row 2 Row 3 Row 4

Data processed by the OpenAIRE system Where does the data come from? Data providers that OpenAIRE continuously collects information from: ~800 They have to provide OpenAIRE-compatible API Types of repositories: institutional, thematic, data, journals, aggregators, etc. CORDA: a system with information about EU projects Others What data is stored in the system? Publication metadata: 17 millions Authors: 16 millions Full-text documents: 4 millions Dataset metadata: 3 millions Projects: 700 thousands Organizations: 60 thousands

Information Inference Service (IIS) Information Inference Service (IIS) is a part of the OpenAIRE system. It is developed by ICM in cooperation with partners from other countries from the consortium. Its goal is to do data/text mining of the gathered data available in OpenAIRE system. It s open source. The code is available at https://github.com/openaire/iis The development of IIS started in 2012.

IIS as a part of the OpenAIRE system

What functionality does IIS provide? IIS consists of a few data processing workflows. The workflows contain various data mining modules that work mostly on the content of documents. Their functionalities: Extract: references to datasets, references to projects, references to research communities, software links, protein database references, citation links, Infer metadata from the content of the PDF document (uses CERMINE) Classify documents Find similar documents Match citation links extracted from document content with actual documents Match author affiliations extracted from document content with actual organizations

How does IIS work? IIS is a Hadoop cluster application. Based on Apache Hadoop technologies: Oozie, MapReduce, Spark, Pig, Avro, Hive (for analytics). Using IIS is like calling a function with subsequent stages: The client starts IIS passing it parameters that define: IIS execution: what modules will be run, what data sets they will be run on, parameters of the modules. imports required data, processes the data using selected modules (this takes a few hours), exports produced data. IIS shuts down. IIS is stateless no information is kept between subsequent runs of IIS (apart from cache used internally)

A few numbers about IIS Hadoop cluster specification: Distribution: Cloudera CDH5 (v.5.9.0) 16 slave nodes, each one with identical specification. This sums up to: CPU: 384 cores, 768 threads RAM: 2048GB HDD: 384TB (HDFS) Duration of the longest processing workflow: 15 hours

Thank you for your attention!