Introduc)on to Big Data

Similar documents
The CMS Computing Model

Compact Muon Solenoid: Cyberinfrastructure Solutions. Ken Bloom UNL Cyberinfrastructure Workshop -- August 15, 2005

Data Transfers Between LHC Grid Sites Dorian Kcira

CSCS CERN videoconference CFD applications

Embedded Technosolutions

From raw data to new fundamental particles: The data management lifecycle at the Large Hadron Collider

Big Computing and the Mitchell Institute for Fundamental Physics and Astronomy. David Toback

High Throughput WAN Data Transfer with Hadoop-based Storage

Stephen J. Gowdy (CERN) 12 th September 2012 XLDB Conference FINDING THE HIGGS IN THE HAYSTACK(S)

Worldwide Production Distributed Data Management at the LHC. Brian Bockelman MSST 2010, 4 May 2010

Challenges and Evolution of the LHC Production Grid. April 13, 2011 Ian Fisk

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

The creation of a Tier-1 Data Center for the ALICE experiment in the UNAM. Lukas Nellen ICN-UNAM

Big Data with Hadoop Ecosystem

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

High-Energy Physics Data-Storage Challenges

Conference The Data Challenges of the LHC. Reda Tafirout, TRIUMF

Storage on the Lunatic Fringe. Thomas M. Ruwart University of Minnesota Digital Technology Center Intelligent Storage Consortium

New strategies of the LHC experiments to meet the computing requirements of the HL-LHC era

CSE6331: Cloud Computing

THE EMC ISILON STORY. Big Data In The Enterprise. Deya Bassiouni Isilon Regional Sales Manager Emerging Africa, Egypt & Lebanon.

ATLAS Experiment and GCE

CERN openlab II. CERN openlab and. Sverre Jarp CERN openlab CTO 16 September 2008

Grid Computing Activities at KIT

2013 AWS Worldwide Public Sector Summit Washington, D.C.

IllustraCve Example of Distributed Analysis in ATLAS Spanish Tier2 and Tier3

Monitoring system for geographically distributed datacenters based on Openstack. Gioacchino Vino

Storage Virtualization. Eric Yen Academia Sinica Grid Computing Centre (ASGC) Taiwan

CERN and Scientific Computing

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Virtualizing a Batch. University Grid Center

Grid Computing: dealing with GB/s dataflows

IEPSAS-Kosice: experiences in running LCG site

CouchDB-based system for data management in a Grid environment Implementation and Experience

CMS Grid Computing at TAMU Performance, Monitoring and Current Status of the Brazos Cluster

CC-IN2P3: A High Performance Data Center for Research

ISTITUTO NAZIONALE DI FISICA NUCLEARE

Experience of the WLCG data management system from the first two years of the LHC data taking

Big Data Analytics and the LHC

Leveraging Software-Defined Storage to Meet Today and Tomorrow s Infrastructure Demands

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

Isilon: Raising The Bar On Performance & Archive Use Cases. John Har Solutions Product Manager Unstructured Data Storage Team

BUSINESS DATA LAKE FADI FAKHOURI, SR. SYSTEMS ENGINEER, ISILON SPECIALIST. Copyright 2016 EMC Corporation. All rights reserved.

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

BlueDBM: An Appliance for Big Data Analytics*

Hadoop, Yarn and Beyond

Insight: that s for NSA Decision making: that s for Google, Facebook. so they find the best way to push out adds and products

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?

Striped Data Server for Scalable Parallel Data Analysis

ACCI Recommendations on Long Term Cyberinfrastructure Issues: Building Future Development

Data Analytics with HPC. Data Streaming

Accelerate Database Performance and Reduce Response Times in MongoDB Humongous Environments with the LSI Nytro MegaRAID Flash Accelerator Card

BIG DATA TESTING: A UNIFIED VIEW

Batch Services at CERN: Status and Future Evolution

Hadoop محبوبه دادخواه کارگاه ساالنه آزمایشگاه فناوری وب زمستان 1391

Overview. About CERN 2 / 11

THE COMPLETE GUIDE HADOOP BACKUP & RECOVERY

Data Management Plan Generic Template Zach S. Henderson Library

Top Trends in DBMS & DW

High Performance Computing on MapReduce Programming Framework

TOOLS FOR INTEGRATING BIG DATA IN CLOUD COMPUTING: A STATE OF ART SURVEY

Storage Resource Sharing with CASTOR.

High Performance and Cloud Computing (HPCC) for Bioinformatics

Towards Network Awareness in LHC Computing

Understanding the T2 traffic in CMS during Run-1

arxiv: v1 [cs.dc] 20 Jul 2015

5 Fundamental Strategies for Building a Data-centered Data Center

Hadoop/MapReduce Computing Paradigm

Invenio: A Modern Digital Library for Grey Literature

Online data storage service strategy for the CERN computer Centre G. Cancio, D. Duellmann, M. Lamanna, A. Pace CERN, Geneva, Switzerland

STATUS OF PLANS TO USE CONTAINERS IN THE WORLDWIDE LHC COMPUTING GRID

Highly Scalable, Non-RDMA NVMe Fabric. Bob Hansen,, VP System Architecture

Big Data and Object Storage

A Survey on Big Data

When, Where & Why to Use NoSQL?

The ATLAS EventIndex: Full chain deployment and first operation

Lessons in Building a Distributed Query Planner. Ozgun Erdogan PGCon 2016

Using the In-Memory Columnar Store to Perform Real-Time Analysis of CERN Data. Maaike Limper Emil Pilecki Manuel Martín Márquez

Developing a Research Data Policy

Gigabyte Bandwidth Enables Global Co-Laboratories

The Materials Data Facility

Cluster Setup and Distributed File System

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Data Movement & Tiering with DMF 7

Grid Computing: dealing with GB/s dataflows

Opportunities A Realistic Study of Costs Associated

Lecture 11 Hadoop & Spark

From Internet Data Centers to Data Centers in the Cloud

Big Data and Cloud Computing

CISC 7610 Lecture 2b The beginnings of NoSQL

Introduction to Grid Computing

CS 6240: Parallel Data Processing in MapReduce: Module 1. Mirek Riedewald

Storage and I/O requirements of the LHC experiments

HPC Growing Pains. IT Lessons Learned from the Biomedical Data Deluge

THE COMPLETE GUIDE COUCHBASE BACKUP & RECOVERY

Long Term Data Preservation for CDF at INFN-CNAF

The LHC Computing Grid

Writing a Data Management Plan A guide for the perplexed

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Perspectives on Open Data in Science Open Data in Science: Challenges & Opportunities for Europe

Transcription:

Introduc)on to Big Data Pradeep Sivakumar, Sr. HPC Specialist David King, Sr. HPC Systems Engineer Research Compu.ng Services Cunera Buys, e- science Librarian NU Library

Table of Contents What is Big Data? CharacterisCcs 3 V s Why is it relevant? Examples Challenges Moving Processing Management

What is Big Data? Big Data are high- volume, high- velocity, and/or high- variety informacon assets that require new forms of processing to enable enhanced decision making, insight discovery and process opcmizacon. Doug Lancy. 2001. 3- D Data Management: Controlling Data Volume, Velocity and Variety. Gartner Group

Data Velocity Big Data Characteris)cs (3 Vs) Data Volume PB Data Variety

Why is it relevant? The model of generacng/consuming data has changed Progress and innovacon no longer hindered by the ability to collect data Ability to manage, analyze, visualize, share, and discover knowledge from the collected data in a Cmely and scalable fashion is criccal The Expanding Digital Universe, IDC, 2007

Examples: Sloan Digital Sky Survey (1992-2008) The Cosmic Genome Project Conducted at Apache Point Observatory, New Mexico Goal: Map one quarter of the sky and create a systemacc, three- dimensional picture of the universe Two surveys in one Photometric survey in 5 bands Spectroscopic redshi\ survey Data is public 2.5 Terapixels of images 10 TB of raw data => 120 TB processed 0.5 TB catalogs => 35 TB in the end *Images credit Fermilab Visual Media Services

Human Genome Project (1990-2003) Sequence the human genome in order to track down the genes responsible for inherited diseases Cost of sequencing human genome has gone down $40 million (2003) $5000 (2013) Spurred innovacon of new visualizacon methods, more robust analysis tools Genomics produces huge volumes of data Human genome has 3 million base pairs, 20000-25000 genes Equivalent to 100 GB *Images credit U.S Department of Energy Genomics Science Program

*Images credit CERN Large Hadron Collider at CERN Goal: Smash protons moving at 99.999999% of the speed of light into each other, beams of protons collide in four points (ALICE, ATLAS, CMS, and LHCb) In November 2012, LHC recorded the first observacons of the Higgs boson parccle (CMS, ATLAS) LHC produces 15 PB annually. Data needs to be accessed and analyzed by thousands of sciencsts internaconally (2000 physicists, 31 countries) Grid infrastructure spread over US, Europe to coordinate data analysis involving networks conneccng dozens of sites and thousands of systems

Challenges in handling Big Data Storage Cost of storage, scalability in throughput Network Large datasets shared among an internaconal community Making data immediate to researchers by providing a local area network Data integrity Including mulcple copies or some form of backup Metadata, provenance, and ontologies

Challenges in handling Big Data (contd) Open access Security issues. Can we support controlled sharing? Very long term data preservacon Preserve datasets for an unlimited Cmespan Technology New architecture, algorithms, techniques are needed Experts in using the new technology and dealing with Big Data

Tradi)onal Data Storage Models TradiConally, data is stored on disk arrays (SAN) Limited storage/compute capacity Throughput is limited to speed of the interface/disks Scalability is limited Expensive Thumb drives and portable drives Reliability Real Cme data processing (velocity vector) is not possible Management, data integracon becomes challenging

Emerging Storage and Data Analysis Trends Break the data apart and distribute it across mulcple machines ComputaCon is done locally near the storage DataSet Block 1 Block 2 Block 3 Data replicacon is built- Block 4 in to the filesystem Block 5 Replica Data/Compute Nodes

Tradi)onal Data Processing Batch processing Generally not real Cme Desktop compucng Slow computacon and throughput Generally limited networking RelaConal Databases (MySQL) Performance issues at large data size Requires structured data

Emerging Big Data Processing Methods Schema- less databases NoSQL databases Processing the data near or on the data cluster MapReduce Massive parallel job execucon of jobs across large data clusters Cloud Storage/Processing Amazon ElasCc MapReduce Google Cloud Plajorm

Processing Datasets using Hadoop Framework that allows for processing of large data sets (Volume) of various data types (Variety) Data stored on commodity hard drives across mulcple machines Allows for scalability Reliability High throughput Low storage latency LimitaCons include OpCmized for moderate to large datasets AdministraCve overhead

Hadoop

Tradi)onal commodity networks Not designed for big data flows Firewalls/intrusion proteccon does not scale Latency becomes a factor Sharing of data is o\en unreliable

Specialized Research Networks IsolaCon of data through Science DMZ model Dedicated infrastructure Guaranteed throughput for real Cme data Less latency through dedicated fiber links

A Case Study for Big Data : The Large Hadron Collider Compact Muon Spectrometer Tracker

Why is the CMS project is a good example? Open accessibility of data Volume of data Data transported for analysis in real Cme Long term data retencon Data is semi- structured

Grid Infrastructure to Transport CMS Data Online system Tier 0 ~Pbyte/s CERN Center Tier 1 10 40 Gbps IN2P3 Center RAL Center INFN Center FNAL Center Tier 2 Caltech ~10 Gbps Florida MIT Nebraska Purdue UCSD Wisconsin 1 to 10 Gbps Tier 3 NU InsCtute InsCtute InsCtute

NU Tier- 3 Implementa)on Internet and Research Networks (Starlight) 10Gbit/s Administrative Nodes Login Nodes tier3.northwestern.edu ttgrid01 Services Node ttgrid02 Namenode ttgrid03 Compute Element ttgrid04 Storage Element ttlogin01 login node ttlogin02 login node Worker Nodes FDR Infiniband Switch ttnode0001 through ttnode0014

Common Data Lifecycle Stages From: Fary, Michael and Owen, Kim, Developing an InsCtuConal Research Data Management Plan Service, Educause ACTI white paper, January 2013, hqp://net.educause.edu/ir/library/pdf/acti1301.pdf

FUNDER REQUIREMENTS

Data Management Plans InformaCon that should be provided: Types of data to be produced. Standards or descripcons that would be used with the data (metadata). How these data will be accessed and shared. Policies and provisions for data sharing and reuse. Provisions for archiving and preservacon. Assistance for Data Management Plans: DMPTool: hqps://dmp.cdlib.org NU Library Data Management Web Page:hqp:// hqp://www.library.northwestern.edu/ dmp

Why share data? Why make it open? Clearly documents and provides evidence for research in conjunccon with published results. Meet copyright and ethical compliance (i.e. HIPAA). Increases the impact of research through data citacon. Preserves data for long- term access and prevents loss of data. Describes and shares data with others to further new discoveries and research. Prevent duplicacon of research. Accelerates the pace of research. Promotes reproducibility of research.

Describe and deposit: metadata

Deposit on publica)on of ar)cle Some Journal publishers require or recommend that supporcng data for arccles be made publicly available. The Joint Data Archiving Policy (JDAP) requires data sharing in a public archive as a condicon of publicacon. hqp://www.datadryad.org/pages/jdap Journals that have adopted JDAP include: Science, Nature and GeneCcs The author is usually responsible for making data available in repository/ archive. Check data archiving policies of journals before submisng arccles.

Deposit and share via repository Does your project have its own repository? Does your funder? If not, databib (hqp://databib.org/) may be helpful for finding a repository. NU is starcng conversacons around local data archiving needs, and would love to hear your input.

Ques)ons?