CERN Open Data and Data Analysis Knowledge Preservation

Similar documents
Towards Reproducible Research Data Analyses in LHC Particle Physics

Open Data and Data Analysis Preservation Services for LHC Experiments

Open access to high-level data and analysis tools in the CMS experiment at the LHC

Helix Nebula The Science Cloud

Open data and scientific reproducibility

New strategies of the LHC experiments to meet the computing requirements of the HL-LHC era

Big Data Analytics and the LHC

Storage on the Lunatic Fringe. Thomas M. Ruwart University of Minnesota Digital Technology Center Intelligent Storage Consortium

CERN Services for Long Term Data Preservation

Data Analysis in ATLAS. Graeme Stewart with thanks to Attila Krasznahorkay and Johannes Elmsheuser

The CMS Computing Model

Tier-2 structure in Poland. R. Gokieli Institute for Nuclear Studies, Warsaw M. Witek Institute of Nuclear Physics, Cracow

Physics Computing at CERN. Helge Meinhard CERN, IT Department OpenLab Student Lecture 27 July 2010

From raw data to new fundamental particles: The data management lifecycle at the Large Hadron Collider

Long Term Data Preservation for CDF at INFN-CNAF

CERN s Business Computing

Preparing for High-Luminosity LHC. Bob Jones CERN Bob.Jones <at> cern.ch

1. Introduction. Outline

Capturing and Analyzing User Behavior in Large Digital Libraries

Physics Computing at CERN. Helge Meinhard CERN, IT Department OpenLab Student Lecture 21 July 2011

Building a Digital Library Software

Horizon Societies of Symbiotic Robot-Plant Bio-Hybrids as Social Architectural Artifacts. Deliverable D4.1

150 million sensors deliver data. 40 million times per second

Persistent Identifier the data publishing perspective. Sünje Dallmeier-Tiessen, CERN 1

The creation of a Tier-1 Data Center for the ALICE experiment in the UNAM. Lukas Nellen ICN-UNAM

Perspectives on Open Data in Science Open Data in Science: Challenges & Opportunities for Europe

Storage and Storage Access

ALICE ANALYSIS PRESERVATION. Mihaela Gheata DASPOS/DPHEP7 workshop

Storage Virtualization. Eric Yen Academia Sinica Grid Computing Centre (ASGC) Taiwan

Computing: new records broken! Data in Tier-0 vs time

Data services for LHC computing

Summary of the LHC Computing Review

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

Andrea Sciabà CERN, Switzerland

Insight: that s for NSA Decision making: that s for Google, Facebook. so they find the best way to push out adds and products

Grid Computing Activities at KIT

Visita delegazione ditte italiane

Design of the protodune raw data management infrastructure

Conference The Data Challenges of the LHC. Reda Tafirout, TRIUMF

Data Curation Profile Movement of Proteins

CERN European Organization for Nuclear Research, 1211 Geneva, CH

Dataverse and DataTags

Challenges and Evolution of the LHC Production Grid. April 13, 2011 Ian Fisk

Storage and I/O requirements of the LHC experiments

IT Challenges and Initiatives in Scientific Research

HEP Grid Activities in China

The evolving role of Tier2s in ATLAS with the new Computing and Data Distribution model

File Access Optimization with the Lustre Filesystem at Florida CMS T2

September Development of favorite collections & visualizing user search queries in CERN Document Server (CDS)

Invenio: A Modern Digital Library for Grey Literature

Inge Van Nieuwerburgh OpenAIRE NOAD Belgium. Tools&Services. OpenAIRE EUDAT. can be reused under the CC BY license

Big Data Analytics Tools. Applied to ATLAS Event Data

The LCG 3D Project. Maria Girone, CERN. The 23rd Open Grid Forum - OGF23 4th June 2008, Barcelona. CERN IT Department CH-1211 Genève 23 Switzerland

The Materials Data Facility

Storage Resource Sharing with CASTOR.

CernVM-FS beyond LHC computing

Belle & Belle II. Takanori Hara (KEK) 9 June, 2015 DPHEP Collaboration CERN

CMS - HLT Configuration Management System

Virtualizing a Batch. University Grid Center

Digital The Harold B. Lee Library

IEPSAS-Kosice: experiences in running LCG site

RADU POPESCU IMPROVING THE WRITE SCALABILITY OF THE CERNVM FILE SYSTEM WITH ERLANG/OTP

Reliability Engineering Analysis of ATLAS Data Reprocessing Campaigns

Using GitHub to open up your software project

NanoAODs Summer student report

Europe and its Open Science Cloud: the Italian perspective. Luciano Gaido Plan-E meeting, Poznan, April

LHCb Computing Strategy

WLCG Transfers Dashboard: a Unified Monitoring Tool for Heterogeneous Data Transfers.

Computing at Belle II

High-Energy Physics Data-Storage Challenges

Introduction to TIND. Guillaume Lastecoueres

Using S3 cloud storage with ROOT and CvmFS

Worldwide Production Distributed Data Management at the LHC. Brian Bockelman MSST 2010, 4 May 2010

Figure 1: cstcdie Grid Site architecture

Netherlands Institute for Radio Astronomy. May 18th, 2009 Hanno Holties

Building a Real-time Notification System

PoS(EGICF12-EMITC2)106

IJDC General Article

Development of DKB ETL module in case of data conversion

Considerations for a grid-based Physics Analysis Facility. Dietrich Liko

The LHC Computing Grid

Improving Packet Processing Performance of a Memory- Bounded Application

The LHC Computing Grid

RADAR A Repository for Long Tail Data

Data publication and discovery with Globus

UW-ATLAS Experiences with Condor

Introduction to Git and GitHub. Tools for collaboratively managing your source code.

Grid Computing: dealing with GB/s dataflows

arxiv: v1 [cs.dc] 12 May 2017

VI-SEEM Data Repository. Presented by: Panayiotis Charalambous

Evolution of Database Replication Technologies for WLCG

Rethinking the Data Model: The Drillbit Proof-of- Concept Library

CouchDB-based system for data management in a Grid environment Implementation and Experience

The CMS data quality monitoring software: experience and future prospects

Data Curation Profile Human Genomics


CERN Tape Archive (CTA) :

The High-Level Dataset-based Data Transfer System in BESDIRAC

Stephen J. Gowdy (CERN) 12 th September 2012 XLDB Conference FINDING THE HIGGS IN THE HAYSTACK(S)

Tools for Data Management. Research Data Management : Session 3 9 th June 2015

Transcription:

CERN Open Data and Data Analysis Knowledge Preservation Tibor Šimko Digital Library 2015 21 23 April 2015 Jasná, Slovakia @tiborsimko @inveniosoftware 1 / 26

1 Invenio @tiborsimko @inveniosoftware 2 / 26

What is Invenio? digital library and document repository software mature platform: first public release in 2002 rich data: articles, books, notes, photos, videos, software, data some Invenio-based services at CERN: co-developed by an international collaboration participating in EU FP7/H2020 projects @tiborsimko @inveniosoftware 3 / 26

Data behind plots @tiborsimko @inveniosoftware 4 / 26

Code you can cite automated GitHub Zenodo bridge push new release to GitHub automatic archival on Zenodo software preserved, minted with a DOI, made citable https://guides.github.com/activities/citable-code @tiborsimko @inveniosoftware 5 / 26

Code Data Paper link data (DATAVERSE) to code (ZENODO) to papers (INSPIRE) example: hep-ex/0011057, arxiv:1401.0080 @tiborsimko @inveniosoftware 6 / 26

2 Data Analysis @tiborsimko @inveniosoftware 7 / 26

CERN LHC Experiments @tiborsimko @inveniosoftware 8 / 26

Large Scale Solutions Primary site: 100k cores (10k nodes), 100k disks (50 PB), 21k NIC Grid: 13 Tier-1 sites, 155 Tier-2 sites, 10 Gbps optical fibre @tiborsimko @inveniosoftware 9 / 26

Preserve an Analysis? @tiborsimko @inveniosoftware 10 / 26

Big Data? data scale knowledge raw GB / sec calibration, conditioning reconstructed PB / year filtering, selection reduced TB / analysis user code, physics objects publication GB / analysis correlation, data behind plots... input filtering output... code Analysis Train @tiborsimko @inveniosoftware 11 / 26

Knowledge Capture @tiborsimko @inveniosoftware 12 / 26

System Architecture TWiki SVN GitHub SharePoint Analysis analysis-preservation.cern.ch file storage abstraction layer CADI CDS INSPIRE... AFS Box Ceph CASTOR Drive EOS S3 @tiborsimko @inveniosoftware 13 / 26

Knowledge Representation record format: extended MARC21 technical metadata: beyond bytes e.g. 256 computer file characteristics $a characteristics $e events $t text $b bytes $f files... knowledge metadata: semantics e.g. 505 formatted contents note CSV column information $t title $g miscellaneous internal format: JSON MARC21 JSON schema model EAD @tiborsimko @inveniosoftware 14 / 26

3 Open Data @tiborsimko @inveniosoftware 15 / 26

Opening Up Data policies: restricted embargo period open [...] Data with high abstraction, such as AOD, will be conditionally made publicly available after an embargo period of 5 years after publication for 10% of the data and 10 years for 100% of the data [...] ALICE Data Policy Challenges: audience: data miners citizen scientists high-school students general public computing: exploring in the browser specialised VMs @tiborsimko @inveniosoftware 16 / 26

CERN Open Data Portal @tiborsimko @inveniosoftware 17 / 26

Education @tiborsimko @inveniosoftware 18 / 26

Visualise detector events @tiborsimko @inveniosoftware 19 / 26

Basic histogramming @tiborsimko @inveniosoftware 20 / 26

Research @tiborsimko @inveniosoftware 21 / 26

CMS Primary Datasets 2010 @tiborsimko @inveniosoftware 22 / 26

CernVM Virtual Machine @tiborsimko @inveniosoftware 23 / 26

Open Data? Who cares? 82,000 distinct users visited the site 21,000 distinct users viewed data records 16,000 distinct users used event display 3,000 distinct users used histogramming @tiborsimko @inveniosoftware 24 / 26

7 Conclusions @tiborsimko @inveniosoftware 25 / 26

CERN (Open) Data Capturing and disseminating knowledge of data, code, platform, processes to enable future data reuse (Open) Data Analysis Preservation Framework http://opendata.cern.ch/ CERN IT J. Cowton, P. Fokianos, J. Kunčar, T. Smith, T. Šimko CERN Library S. Dallmeier-Tiessen, P. Herterich, L. Rueda ALICE M. Gheata, C. Grigoras ATLAS K. Cranmer, L. Heinrich, D. Rousseau, F. Socher CMS A. Calderon, A. Huffman, K. Lassila-Perini, T. McCauley, A. Rao, A. Rodriguez Marrero LHCb S. Amerio, B. Couturier, A. Trisovic CERN CernVM J. Blomer CERN EOS L. Mascetti DASPOS M. Hildreth DPHEP F. Berghaus @tiborsimko @inveniosoftware 26 / 26