Introduc)on to Big Data

Size: px
Start display at page:

Download "Introduc)on to Big Data"

Transcription

1 Introduc)on to Big Data Pradeep Sivakumar, Sr. HPC Specialist David King, Sr. HPC Systems Engineer Research Compu.ng Services Cunera Buys, e- science Librarian NU Library

2 Table of Contents What is Big Data? CharacterisCcs 3 V s Why is it relevant? Examples Challenges Moving Processing Management

3 What is Big Data? Big Data are high- volume, high- velocity, and/or high- variety informacon assets that require new forms of processing to enable enhanced decision making, insight discovery and process opcmizacon. Doug Lancy D Data Management: Controlling Data Volume, Velocity and Variety. Gartner Group

4 Data Velocity Big Data Characteris)cs (3 Vs) Data Volume PB Data Variety

5 Why is it relevant? The model of generacng/consuming data has changed Progress and innovacon no longer hindered by the ability to collect data Ability to manage, analyze, visualize, share, and discover knowledge from the collected data in a Cmely and scalable fashion is criccal The Expanding Digital Universe, IDC, 2007

6 Examples: Sloan Digital Sky Survey ( ) The Cosmic Genome Project Conducted at Apache Point Observatory, New Mexico Goal: Map one quarter of the sky and create a systemacc, three- dimensional picture of the universe Two surveys in one Photometric survey in 5 bands Spectroscopic redshi\ survey Data is public 2.5 Terapixels of images 10 TB of raw data => 120 TB processed 0.5 TB catalogs => 35 TB in the end *Images credit Fermilab Visual Media Services

7 Human Genome Project ( ) Sequence the human genome in order to track down the genes responsible for inherited diseases Cost of sequencing human genome has gone down $40 million (2003) $5000 (2013) Spurred innovacon of new visualizacon methods, more robust analysis tools Genomics produces huge volumes of data Human genome has 3 million base pairs, genes Equivalent to 100 GB *Images credit U.S Department of Energy Genomics Science Program

8 *Images credit CERN Large Hadron Collider at CERN Goal: Smash protons moving at % of the speed of light into each other, beams of protons collide in four points (ALICE, ATLAS, CMS, and LHCb) In November 2012, LHC recorded the first observacons of the Higgs boson parccle (CMS, ATLAS) LHC produces 15 PB annually. Data needs to be accessed and analyzed by thousands of sciencsts internaconally (2000 physicists, 31 countries) Grid infrastructure spread over US, Europe to coordinate data analysis involving networks conneccng dozens of sites and thousands of systems

9 Challenges in handling Big Data Storage Cost of storage, scalability in throughput Network Large datasets shared among an internaconal community Making data immediate to researchers by providing a local area network Data integrity Including mulcple copies or some form of backup Metadata, provenance, and ontologies

10 Challenges in handling Big Data (contd) Open access Security issues. Can we support controlled sharing? Very long term data preservacon Preserve datasets for an unlimited Cmespan Technology New architecture, algorithms, techniques are needed Experts in using the new technology and dealing with Big Data

11 Tradi)onal Data Storage Models TradiConally, data is stored on disk arrays (SAN) Limited storage/compute capacity Throughput is limited to speed of the interface/disks Scalability is limited Expensive Thumb drives and portable drives Reliability Real Cme data processing (velocity vector) is not possible Management, data integracon becomes challenging

12 Emerging Storage and Data Analysis Trends Break the data apart and distribute it across mulcple machines ComputaCon is done locally near the storage DataSet Block 1 Block 2 Block 3 Data replicacon is built- Block 4 in to the filesystem Block 5 Replica Data/Compute Nodes

13 Tradi)onal Data Processing Batch processing Generally not real Cme Desktop compucng Slow computacon and throughput Generally limited networking RelaConal Databases (MySQL) Performance issues at large data size Requires structured data

14 Emerging Big Data Processing Methods Schema- less databases NoSQL databases Processing the data near or on the data cluster MapReduce Massive parallel job execucon of jobs across large data clusters Cloud Storage/Processing Amazon ElasCc MapReduce Google Cloud Plajorm

15 Processing Datasets using Hadoop Framework that allows for processing of large data sets (Volume) of various data types (Variety) Data stored on commodity hard drives across mulcple machines Allows for scalability Reliability High throughput Low storage latency LimitaCons include OpCmized for moderate to large datasets AdministraCve overhead

16 Hadoop

17 Tradi)onal commodity networks Not designed for big data flows Firewalls/intrusion proteccon does not scale Latency becomes a factor Sharing of data is o\en unreliable

18 Specialized Research Networks IsolaCon of data through Science DMZ model Dedicated infrastructure Guaranteed throughput for real Cme data Less latency through dedicated fiber links

19 A Case Study for Big Data : The Large Hadron Collider Compact Muon Spectrometer Tracker

20 Why is the CMS project is a good example? Open accessibility of data Volume of data Data transported for analysis in real Cme Long term data retencon Data is semi- structured

21 Grid Infrastructure to Transport CMS Data Online system Tier 0 ~Pbyte/s CERN Center Tier Gbps IN2P3 Center RAL Center INFN Center FNAL Center Tier 2 Caltech ~10 Gbps Florida MIT Nebraska Purdue UCSD Wisconsin 1 to 10 Gbps Tier 3 NU InsCtute InsCtute InsCtute

22 NU Tier- 3 Implementa)on Internet and Research Networks (Starlight) 10Gbit/s Administrative Nodes Login Nodes tier3.northwestern.edu ttgrid01 Services Node ttgrid02 Namenode ttgrid03 Compute Element ttgrid04 Storage Element ttlogin01 login node ttlogin02 login node Worker Nodes FDR Infiniband Switch ttnode0001 through ttnode0014

23 Common Data Lifecycle Stages From: Fary, Michael and Owen, Kim, Developing an InsCtuConal Research Data Management Plan Service, Educause ACTI white paper, January 2013, hqp://net.educause.edu/ir/library/pdf/acti1301.pdf

24 FUNDER REQUIREMENTS

25 Data Management Plans InformaCon that should be provided: Types of data to be produced. Standards or descripcons that would be used with the data (metadata). How these data will be accessed and shared. Policies and provisions for data sharing and reuse. Provisions for archiving and preservacon. Assistance for Data Management Plans: DMPTool: hqps://dmp.cdlib.org NU Library Data Management Web Page:hqp:// hqp:// dmp

26 Why share data? Why make it open? Clearly documents and provides evidence for research in conjunccon with published results. Meet copyright and ethical compliance (i.e. HIPAA). Increases the impact of research through data citacon. Preserves data for long- term access and prevents loss of data. Describes and shares data with others to further new discoveries and research. Prevent duplicacon of research. Accelerates the pace of research. Promotes reproducibility of research.

27 Describe and deposit: metadata

28 Deposit on publica)on of ar)cle Some Journal publishers require or recommend that supporcng data for arccles be made publicly available. The Joint Data Archiving Policy (JDAP) requires data sharing in a public archive as a condicon of publicacon. hqp:// Journals that have adopted JDAP include: Science, Nature and GeneCcs The author is usually responsible for making data available in repository/ archive. Check data archiving policies of journals before submisng arccles.

29

30 Deposit and share via repository Does your project have its own repository? Does your funder? If not, databib (hqp://databib.org/) may be helpful for finding a repository. NU is starcng conversacons around local data archiving needs, and would love to hear your input.

31 Ques)ons?

The CMS Computing Model

The CMS Computing Model The CMS Computing Model Dorian Kcira California Institute of Technology SuperComputing 2009 November 14-20 2009, Portland, OR CERN s Large Hadron Collider 5000+ Physicists/Engineers 300+ Institutes 70+

More information

Compact Muon Solenoid: Cyberinfrastructure Solutions. Ken Bloom UNL Cyberinfrastructure Workshop -- August 15, 2005

Compact Muon Solenoid: Cyberinfrastructure Solutions. Ken Bloom UNL Cyberinfrastructure Workshop -- August 15, 2005 Compact Muon Solenoid: Cyberinfrastructure Solutions Ken Bloom UNL Cyberinfrastructure Workshop -- August 15, 2005 Computing Demands CMS must provide computing to handle huge data rates and sizes, and

More information

Data Transfers Between LHC Grid Sites Dorian Kcira

Data Transfers Between LHC Grid Sites Dorian Kcira Data Transfers Between LHC Grid Sites Dorian Kcira dkcira@caltech.edu Caltech High Energy Physics Group hep.caltech.edu/cms CERN Site: LHC and the Experiments Large Hadron Collider 27 km circumference

More information

CSCS CERN videoconference CFD applications

CSCS CERN videoconference CFD applications CSCS CERN videoconference CFD applications TS/CV/Detector Cooling - CFD Team CERN June 13 th 2006 Michele Battistin June 2006 CERN & CFD Presentation 1 TOPICS - Some feedback about already existing collaboration

More information

Embedded Technosolutions

Embedded Technosolutions Hadoop Big Data An Important technology in IT Sector Hadoop - Big Data Oerie 90% of the worlds data was generated in the last few years. Due to the advent of new technologies, devices, and communication

More information

From raw data to new fundamental particles: The data management lifecycle at the Large Hadron Collider

From raw data to new fundamental particles: The data management lifecycle at the Large Hadron Collider From raw data to new fundamental particles: The data management lifecycle at the Large Hadron Collider Andrew Washbrook School of Physics and Astronomy University of Edinburgh Dealing with Data Conference

More information

Big Computing and the Mitchell Institute for Fundamental Physics and Astronomy. David Toback

Big Computing and the Mitchell Institute for Fundamental Physics and Astronomy. David Toback Big Computing and the Mitchell Institute for Fundamental Physics and Astronomy Texas A&M Big Data Workshop October 2011 January 2015, Texas A&M University Research Topics Seminar 1 Outline Overview of

More information

High Throughput WAN Data Transfer with Hadoop-based Storage

High Throughput WAN Data Transfer with Hadoop-based Storage High Throughput WAN Data Transfer with Hadoop-based Storage A Amin 2, B Bockelman 4, J Letts 1, T Levshina 3, T Martin 1, H Pi 1, I Sfiligoi 1, M Thomas 2, F Wuerthwein 1 1 University of California, San

More information

Stephen J. Gowdy (CERN) 12 th September 2012 XLDB Conference FINDING THE HIGGS IN THE HAYSTACK(S)

Stephen J. Gowdy (CERN) 12 th September 2012 XLDB Conference FINDING THE HIGGS IN THE HAYSTACK(S) Stephen J. Gowdy (CERN) 12 th September 2012 XLDB Conference FINDING THE HIGGS IN THE HAYSTACK(S) Overview Large Hadron Collider (LHC) Compact Muon Solenoid (CMS) experiment The Challenge Worldwide LHC

More information

Worldwide Production Distributed Data Management at the LHC. Brian Bockelman MSST 2010, 4 May 2010

Worldwide Production Distributed Data Management at the LHC. Brian Bockelman MSST 2010, 4 May 2010 Worldwide Production Distributed Data Management at the LHC Brian Bockelman MSST 2010, 4 May 2010 At the LHC http://op-webtools.web.cern.ch/opwebtools/vistar/vistars.php?usr=lhc1 Gratuitous detector pictures:

More information

Challenges and Evolution of the LHC Production Grid. April 13, 2011 Ian Fisk

Challenges and Evolution of the LHC Production Grid. April 13, 2011 Ian Fisk Challenges and Evolution of the LHC Production Grid April 13, 2011 Ian Fisk 1 Evolution Uni x ALICE Remote Access PD2P/ Popularity Tier-2 Tier-2 Uni u Open Lab m Tier-2 Science Uni x Grid Uni z USA Tier-2

More information

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez Scientific data processing at global scale The LHC Computing Grid Chengdu (China), July 5th 2011 Who I am 2 Computing science background Working in the field of computing for high-energy physics since

More information

The creation of a Tier-1 Data Center for the ALICE experiment in the UNAM. Lukas Nellen ICN-UNAM

The creation of a Tier-1 Data Center for the ALICE experiment in the UNAM. Lukas Nellen ICN-UNAM The creation of a Tier-1 Data Center for the ALICE experiment in the UNAM Lukas Nellen ICN-UNAM lukas@nucleares.unam.mx 3rd BigData BigNetworks Conference Puerto Vallarta April 23, 2015 Who Am I? ALICE

More information

Big Data with Hadoop Ecosystem

Big Data with Hadoop Ecosystem Diógenes Pires Big Data with Hadoop Ecosystem Hands-on (HBase, MySql and Hive + Power BI) Internet Live http://www.internetlivestats.com/ Introduction Business Intelligence Business Intelligence Process

More information

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?

More information

High-Energy Physics Data-Storage Challenges

High-Energy Physics Data-Storage Challenges High-Energy Physics Data-Storage Challenges Richard P. Mount SLAC SC2003 Experimental HENP Understanding the quantum world requires: Repeated measurement billions of collisions Large (500 2000 physicist)

More information

Conference The Data Challenges of the LHC. Reda Tafirout, TRIUMF

Conference The Data Challenges of the LHC. Reda Tafirout, TRIUMF Conference 2017 The Data Challenges of the LHC Reda Tafirout, TRIUMF Outline LHC Science goals, tools and data Worldwide LHC Computing Grid Collaboration & Scale Key challenges Networking ATLAS experiment

More information

Storage on the Lunatic Fringe. Thomas M. Ruwart University of Minnesota Digital Technology Center Intelligent Storage Consortium

Storage on the Lunatic Fringe. Thomas M. Ruwart University of Minnesota Digital Technology Center Intelligent Storage Consortium Storage on the Lunatic Fringe Thomas M. Ruwart University of Minnesota Digital Technology Center Intelligent Storage Consortium tmruwart@dtc.umn.edu Orientation Who are the lunatics? What are their requirements?

More information

New strategies of the LHC experiments to meet the computing requirements of the HL-LHC era

New strategies of the LHC experiments to meet the computing requirements of the HL-LHC era to meet the computing requirements of the HL-LHC era NPI AS CR Prague/Rez E-mail: adamova@ujf.cas.cz Maarten Litmaath CERN E-mail: Maarten.Litmaath@cern.ch The performance of the Large Hadron Collider

More information

CSE6331: Cloud Computing

CSE6331: Cloud Computing CSE6331: Cloud Computing Leonidas Fegaras University of Texas at Arlington c 2019 by Leonidas Fegaras Cloud Computing Fundamentals Based on: J. Freire s class notes on Big Data http://vgc.poly.edu/~juliana/courses/bigdata2016/

More information

THE EMC ISILON STORY. Big Data In The Enterprise. Deya Bassiouni Isilon Regional Sales Manager Emerging Africa, Egypt & Lebanon.

THE EMC ISILON STORY. Big Data In The Enterprise. Deya Bassiouni Isilon Regional Sales Manager Emerging Africa, Egypt & Lebanon. THE EMC ISILON STORY Big Data In The Enterprise Deya Bassiouni Isilon Regional Sales Manager Emerging Africa, Egypt & Lebanon August, 2012 1 Big Data In The Enterprise Isilon Overview Isilon Technology

More information

ATLAS Experiment and GCE

ATLAS Experiment and GCE ATLAS Experiment and GCE Google IO Conference San Francisco, CA Sergey Panitkin (BNL) and Andrew Hanushevsky (SLAC), for the ATLAS Collaboration ATLAS Experiment The ATLAS is one of the six particle detectors

More information

CERN openlab II. CERN openlab and. Sverre Jarp CERN openlab CTO 16 September 2008

CERN openlab II. CERN openlab and. Sverre Jarp CERN openlab CTO 16 September 2008 CERN openlab II CERN openlab and Intel: Today and Tomorrow Sverre Jarp CERN openlab CTO 16 September 2008 Overview of CERN 2 CERN is the world's largest particle physics centre What is CERN? Particle physics

More information

Grid Computing Activities at KIT

Grid Computing Activities at KIT Grid Computing Activities at KIT Meeting between NCP and KIT, 21.09.2015 Manuel Giffels Karlsruhe Institute of Technology Institute of Experimental Nuclear Physics & Steinbuch Center for Computing Courtesy

More information

2013 AWS Worldwide Public Sector Summit Washington, D.C.

2013 AWS Worldwide Public Sector Summit Washington, D.C. 2013 AWS Worldwide Public Sector Summit Washington, D.C. EMR for Fun and for Profit Ben Butler Sr. Manager, Big Data butlerb@amazon.com @bensbutler Overview 1. What is big data? 2. What is AWS Elastic

More information

IllustraCve Example of Distributed Analysis in ATLAS Spanish Tier2 and Tier3

IllustraCve Example of Distributed Analysis in ATLAS Spanish Tier2 and Tier3 IllustraCve Example of Distributed Analysis in ATLAS Spanish Tier2 and Tier3 S. González, E. Oliver, M. Villaplana, A. Fernández, M. Kaci, A. Lamas, J. Salt, J. Sánchez PCI2010 Workshop Rabat, 5 th 7 th

More information

Monitoring system for geographically distributed datacenters based on Openstack. Gioacchino Vino

Monitoring system for geographically distributed datacenters based on Openstack. Gioacchino Vino Monitoring system for geographically distributed datacenters based on Openstack Gioacchino Vino Tutor: Dott. Domenico Elia Tutor: Dott. Giacinto Donvito Borsa di studio GARR Orio Carlini 2016-2017 INFN

More information

Storage Virtualization. Eric Yen Academia Sinica Grid Computing Centre (ASGC) Taiwan

Storage Virtualization. Eric Yen Academia Sinica Grid Computing Centre (ASGC) Taiwan Storage Virtualization Eric Yen Academia Sinica Grid Computing Centre (ASGC) Taiwan Storage Virtualization In computer science, storage virtualization uses virtualization to enable better functionality

More information

CERN and Scientific Computing

CERN and Scientific Computing CERN and Scientific Computing Massimo Lamanna CERN Information Technology Department Experiment Support Group 1960: 26 GeV proton in the 32 cm CERN hydrogen bubble chamber 1960: IBM 709 at the Geneva airport

More information

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018 Cloud Computing and Hadoop Distributed File System UCSB CS70, Spring 08 Cluster Computing Motivations Large-scale data processing on clusters Scan 000 TB on node @ 00 MB/s = days Scan on 000-node cluster

More information

Virtualizing a Batch. University Grid Center

Virtualizing a Batch. University Grid Center Virtualizing a Batch Queuing System at a University Grid Center Volker Büge (1,2), Yves Kemp (1), Günter Quast (1), Oliver Oberst (1), Marcel Kunze (2) (1) University of Karlsruhe (2) Forschungszentrum

More information

Grid Computing: dealing with GB/s dataflows

Grid Computing: dealing with GB/s dataflows Grid Computing: dealing with GB/s dataflows Jan Just Keijser, Nikhef janjust@nikhef.nl David Groep, NIKHEF 21 March 2011 Graphics: Real Time Monitor, Gidon Moont, Imperial College London, see http://gridportal.hep.ph.ic.ac.uk/rtm/

More information

IEPSAS-Kosice: experiences in running LCG site

IEPSAS-Kosice: experiences in running LCG site IEPSAS-Kosice: experiences in running LCG site Marian Babik 1, Dusan Bruncko 2, Tomas Daranyi 1, Ladislav Hluchy 1 and Pavol Strizenec 2 1 Department of Parallel and Distributed Computing, Institute of

More information

CouchDB-based system for data management in a Grid environment Implementation and Experience

CouchDB-based system for data management in a Grid environment Implementation and Experience CouchDB-based system for data management in a Grid environment Implementation and Experience Hassen Riahi IT/SDC, CERN Outline Context Problematic and strategy System architecture Integration and deployment

More information

CMS Grid Computing at TAMU Performance, Monitoring and Current Status of the Brazos Cluster

CMS Grid Computing at TAMU Performance, Monitoring and Current Status of the Brazos Cluster CMS Grid Computing at TAMU Performance, Monitoring and Current Status of the Brazos Cluster Vaikunth Thukral Department of Physics and Astronomy Texas A&M University 1 Outline Grid Computing with CMS:

More information

CC-IN2P3: A High Performance Data Center for Research

CC-IN2P3: A High Performance Data Center for Research April 15 th, 2011 CC-IN2P3: A High Performance Data Center for Research Toward a partnership with DELL Dominique Boutigny Agenda Welcome Introduction to CC-IN2P3 Visit of the computer room Lunch Discussion

More information

ISTITUTO NAZIONALE DI FISICA NUCLEARE

ISTITUTO NAZIONALE DI FISICA NUCLEARE ISTITUTO NAZIONALE DI FISICA NUCLEARE Sezione di Perugia INFN/TC-05/10 July 4, 2005 DESIGN, IMPLEMENTATION AND CONFIGURATION OF A GRID SITE WITH A PRIVATE NETWORK ARCHITECTURE Leonello Servoli 1,2!, Mirko

More information

Experience of the WLCG data management system from the first two years of the LHC data taking

Experience of the WLCG data management system from the first two years of the LHC data taking Experience of the WLCG data management system from the first two years of the LHC data taking 1 Nuclear Physics Institute, Czech Academy of Sciences Rez near Prague, CZ 25068, Czech Republic E-mail: adamova@ujf.cas.cz

More information

Big Data Analytics and the LHC

Big Data Analytics and the LHC Big Data Analytics and the LHC Maria Girone CERN openlab CTO Computing Frontiers 2016, Como, May 2016 DOI: 10.5281/zenodo.45449, CC-BY-SA, images courtesy of CERN 2 3 xx 4 Big bang in the laboratory We

More information

Leveraging Software-Defined Storage to Meet Today and Tomorrow s Infrastructure Demands

Leveraging Software-Defined Storage to Meet Today and Tomorrow s Infrastructure Demands Leveraging Software-Defined Storage to Meet Today and Tomorrow s Infrastructure Demands Unleash Your Data Center s Hidden Power September 16, 2014 Molly Rector CMO, EVP Product Management & WW Marketing

More information

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018 Big Data com Hadoop Impala, Hive e Spark VIII Sessão - SQL Bahia 03/03/2018 Diógenes Pires Connect with PASS Sign up for a free membership today at: pass.org #sqlpass Internet Live http://www.internetlivestats.com/

More information

Isilon: Raising The Bar On Performance & Archive Use Cases. John Har Solutions Product Manager Unstructured Data Storage Team

Isilon: Raising The Bar On Performance & Archive Use Cases. John Har Solutions Product Manager Unstructured Data Storage Team Isilon: Raising The Bar On Performance & Archive Use Cases John Har Solutions Product Manager Unstructured Data Storage Team What we ll cover in this session Isilon Overview Streaming workflows High ops/s

More information

BUSINESS DATA LAKE FADI FAKHOURI, SR. SYSTEMS ENGINEER, ISILON SPECIALIST. Copyright 2016 EMC Corporation. All rights reserved.

BUSINESS DATA LAKE FADI FAKHOURI, SR. SYSTEMS ENGINEER, ISILON SPECIALIST. Copyright 2016 EMC Corporation. All rights reserved. BUSINESS DATA LAKE FADI FAKHOURI, SR. SYSTEMS ENGINEER, ISILON SPECIALIST 1 UNSTRUCTURED DATA GROWTH 75% 78% 80% 2015 71 EB 2016 106 EB 2017 133 EB Total Capacity Shipped, Worldwide % of Unstructured Data

More information

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed? Simple to start What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed? What is the maximum download speed you get? Simple computation

More information

BlueDBM: An Appliance for Big Data Analytics*

BlueDBM: An Appliance for Big Data Analytics* BlueDBM: An Appliance for Big Data Analytics* Arvind *[ISCA, 2015] Sang-Woo Jun, Ming Liu, Sungjin Lee, Shuotao Xu, Arvind (MIT) and Jamey Hicks, John Ankcorn, Myron King(Quanta) BigData@CSAIL Annual Meeting

More information

Hadoop, Yarn and Beyond

Hadoop, Yarn and Beyond Hadoop, Yarn and Beyond 1 B. R A M A M U R T H Y Overview We learned about Hadoop1.x or the core. Just like Java evolved, Java core, Java 1.X, Java 2.. So on, software and systems evolve, naturally.. Lets

More information

Insight: that s for NSA Decision making: that s for Google, Facebook. so they find the best way to push out adds and products

Insight: that s for NSA Decision making: that s for Google, Facebook. so they find the best way to push out adds and products What is big data? Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.

More information

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects? Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects? N. S. Islam, X. Lu, M. W. Rahman, and D. K. Panda Network- Based Compu2ng Laboratory Department of Computer

More information

Striped Data Server for Scalable Parallel Data Analysis

Striped Data Server for Scalable Parallel Data Analysis Journal of Physics: Conference Series PAPER OPEN ACCESS Striped Data Server for Scalable Parallel Data Analysis To cite this article: Jin Chang et al 2018 J. Phys.: Conf. Ser. 1085 042035 View the article

More information

ACCI Recommendations on Long Term Cyberinfrastructure Issues: Building Future Development

ACCI Recommendations on Long Term Cyberinfrastructure Issues: Building Future Development ACCI Recommendations on Long Term Cyberinfrastructure Issues: Building Future Development Jeremy Fischer Indiana University 9 September 2014 Citation: Fischer, J.L. 2014. ACCI Recommendations on Long Term

More information

Data Analytics with HPC. Data Streaming

Data Analytics with HPC. Data Streaming Data Analytics with HPC Data Streaming Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Accelerate Database Performance and Reduce Response Times in MongoDB Humongous Environments with the LSI Nytro MegaRAID Flash Accelerator Card

Accelerate Database Performance and Reduce Response Times in MongoDB Humongous Environments with the LSI Nytro MegaRAID Flash Accelerator Card Accelerate Database Performance and Reduce Response Times in MongoDB Humongous Environments with the LSI Nytro MegaRAID Flash Accelerator Card The Rise of MongoDB Summary One of today s growing database

More information

BIG DATA TESTING: A UNIFIED VIEW

BIG DATA TESTING: A UNIFIED VIEW http://core.ecu.edu/strg BIG DATA TESTING: A UNIFIED VIEW BY NAM THAI ECU, Computer Science Department, March 16, 2016 2/30 PRESENTATION CONTENT 1. Overview of Big Data A. 5 V s of Big Data B. Data generation

More information

Batch Services at CERN: Status and Future Evolution

Batch Services at CERN: Status and Future Evolution Batch Services at CERN: Status and Future Evolution Helge Meinhard, CERN-IT Platform and Engineering Services Group Leader HTCondor Week 20 May 2015 20-May-2015 CERN batch status and evolution - Helge

More information

Hadoop محبوبه دادخواه کارگاه ساالنه آزمایشگاه فناوری وب زمستان 1391

Hadoop محبوبه دادخواه کارگاه ساالنه آزمایشگاه فناوری وب زمستان 1391 Hadoop محبوبه دادخواه کارگاه ساالنه آزمایشگاه فناوری وب زمستان 1391 Outline Big Data Big Data Examples Challenges with traditional storage NoSQL Hadoop HDFS MapReduce Architecture 2 Big Data In information

More information

Overview. About CERN 2 / 11

Overview. About CERN 2 / 11 Overview CERN wanted to upgrade the data monitoring system of one of its Large Hadron Collider experiments called ALICE (A La rge Ion Collider Experiment) to ensure the experiment s high efficiency. They

More information

THE COMPLETE GUIDE HADOOP BACKUP & RECOVERY

THE COMPLETE GUIDE HADOOP BACKUP & RECOVERY THE COMPLETE GUIDE HADOOP BACKUP & RECOVERY INTRODUCTION Driven by the need to remain competitive and differentiate themselves, organizations are undergoing digital transformations and becoming increasingly

More information

Data Management Plan Generic Template Zach S. Henderson Library

Data Management Plan Generic Template Zach S. Henderson Library Data Management Plan Generic Template Zach S. Henderson Library Use this Template to prepare a generic data management plan (DMP). This template does not correspond to any particular grant funder s DMP

More information

Top Trends in DBMS & DW

Top Trends in DBMS & DW Oracle Top Trends in DBMS & DW Noel Yuhanna Principal Analyst Forrester Research Trend #1: Proliferation of data Data doubles every 18-24 months for critical Apps, for some its every 6 months Terabyte

More information

High Performance Computing on MapReduce Programming Framework

High Performance Computing on MapReduce Programming Framework International Journal of Private Cloud Computing Environment and Management Vol. 2, No. 1, (2015), pp. 27-32 http://dx.doi.org/10.21742/ijpccem.2015.2.1.04 High Performance Computing on MapReduce Programming

More information

TOOLS FOR INTEGRATING BIG DATA IN CLOUD COMPUTING: A STATE OF ART SURVEY

TOOLS FOR INTEGRATING BIG DATA IN CLOUD COMPUTING: A STATE OF ART SURVEY Journal of Analysis and Computation (JAC) (An International Peer Reviewed Journal), www.ijaconline.com, ISSN 0973-2861 International Conference on Emerging Trends in IOT & Machine Learning, 2018 TOOLS

More information

Storage Resource Sharing with CASTOR.

Storage Resource Sharing with CASTOR. Storage Resource Sharing with CASTOR Olof Barring, Benjamin Couturier, Jean-Damien Durand, Emil Knezo, Sebastien Ponce (CERN) Vitali Motyakov (IHEP) ben.couturier@cern.ch 16/4/2004 Storage Resource Sharing

More information

High Performance and Cloud Computing (HPCC) for Bioinformatics

High Performance and Cloud Computing (HPCC) for Bioinformatics High Performance and Cloud Computing (HPCC) for Bioinformatics King Jordan Georgia Tech January 13, 2016 Adopted From BIOS-ICGEB HPCC for Bioinformatics 1 Outline High performance computing (HPC) Cloud

More information

Towards Network Awareness in LHC Computing

Towards Network Awareness in LHC Computing Towards Network Awareness in LHC Computing CMS ALICE CERN Atlas LHCb LHC Run1: Discovery of a New Boson LHC Run2: Beyond the Standard Model Gateway to a New Era Artur Barczyk / Caltech Internet2 Technology

More information

Understanding the T2 traffic in CMS during Run-1

Understanding the T2 traffic in CMS during Run-1 Journal of Physics: Conference Series PAPER OPEN ACCESS Understanding the T2 traffic in CMS during Run-1 To cite this article: Wildish T and 2015 J. Phys.: Conf. Ser. 664 032034 View the article online

More information

arxiv: v1 [cs.dc] 20 Jul 2015

arxiv: v1 [cs.dc] 20 Jul 2015 Designing Computing System Architecture and Models for the HL-LHC era arxiv:1507.07430v1 [cs.dc] 20 Jul 2015 L Bauerdick 1, B Bockelman 2, P Elmer 3, S Gowdy 1, M Tadel 4 and F Würthwein 4 1 Fermilab,

More information

5 Fundamental Strategies for Building a Data-centered Data Center

5 Fundamental Strategies for Building a Data-centered Data Center 5 Fundamental Strategies for Building a Data-centered Data Center June 3, 2014 Ken Krupa, Chief Field Architect Gary Vidal, Solutions Specialist Last generation Reference Data Unstructured OLTP Warehouse

More information

Hadoop/MapReduce Computing Paradigm

Hadoop/MapReduce Computing Paradigm Hadoop/Reduce Computing Paradigm 1 Large-Scale Data Analytics Reduce computing paradigm (E.g., Hadoop) vs. Traditional database systems vs. Database Many enterprises are turning to Hadoop Especially applications

More information

Invenio: A Modern Digital Library for Grey Literature

Invenio: A Modern Digital Library for Grey Literature Invenio: A Modern Digital Library for Grey Literature Jérôme Caffaro, CERN Samuele Kaplun, CERN November 25, 2010 Abstract Grey literature has historically played a key role for researchers in the field

More information

Online data storage service strategy for the CERN computer Centre G. Cancio, D. Duellmann, M. Lamanna, A. Pace CERN, Geneva, Switzerland

Online data storage service strategy for the CERN computer Centre G. Cancio, D. Duellmann, M. Lamanna, A. Pace CERN, Geneva, Switzerland Online data storage service strategy for the CERN computer Centre G. Cancio, D. Duellmann, M. Lamanna, A. Pace CERN, Geneva, Switzerland Abstract. The Data and Storage Services group at CERN is conducting

More information

STATUS OF PLANS TO USE CONTAINERS IN THE WORLDWIDE LHC COMPUTING GRID

STATUS OF PLANS TO USE CONTAINERS IN THE WORLDWIDE LHC COMPUTING GRID The WLCG Motivation and benefits Container engines Experiments status and plans Security considerations Summary and outlook STATUS OF PLANS TO USE CONTAINERS IN THE WORLDWIDE LHC COMPUTING GRID SWISS EXPERIENCE

More information

Highly Scalable, Non-RDMA NVMe Fabric. Bob Hansen,, VP System Architecture

Highly Scalable, Non-RDMA NVMe Fabric. Bob Hansen,, VP System Architecture A Cost Effective,, High g Performance,, Highly Scalable, Non-RDMA NVMe Fabric Bob Hansen,, VP System Architecture bob@apeirondata.com Storage Developers Conference, September 2015 Agenda 3 rd Platform

More information

Big Data and Object Storage

Big Data and Object Storage Big Data and Object Storage or where to store the cold and small data? Sven Bauernfeind Computacenter AG & Co. ohg, Consultancy Germany 28.02.2018 Munich Volume, Variety & Velocity + Analytics Velocity

More information

A Survey on Big Data

A Survey on Big Data A Survey on Big Data D.Prudhvi 1, D.Jaswitha 2, B. Mounika 3, Monika Bagal 4 1 2 3 4 B.Tech Final Year, CSE, Dadi Institute of Engineering & Technology,Andhra Pradesh,INDIA ---------------------------------------------------------------------***---------------------------------------------------------------------

More information

When, Where & Why to Use NoSQL?

When, Where & Why to Use NoSQL? When, Where & Why to Use NoSQL? 1 Big data is becoming a big challenge for enterprises. Many organizations have built environments for transactional data with Relational Database Management Systems (RDBMS),

More information

The ATLAS EventIndex: Full chain deployment and first operation

The ATLAS EventIndex: Full chain deployment and first operation The ATLAS EventIndex: Full chain deployment and first operation Álvaro Fernández Casaní Instituto de Física Corpuscular () Universitat de València CSIC On behalf of the ATLAS Collaboration 1 Outline ATLAS

More information

Lessons in Building a Distributed Query Planner. Ozgun Erdogan PGCon 2016

Lessons in Building a Distributed Query Planner. Ozgun Erdogan PGCon 2016 Lessons in Building a Distributed Query Planner Ozgun Erdogan PGCon 2016 Talk Outline 1. IntroducCon 2. Key insight in distributed planning 3. Distributed logical plans 4. Distributed physical plans 5.

More information

Using the In-Memory Columnar Store to Perform Real-Time Analysis of CERN Data. Maaike Limper Emil Pilecki Manuel Martín Márquez

Using the In-Memory Columnar Store to Perform Real-Time Analysis of CERN Data. Maaike Limper Emil Pilecki Manuel Martín Márquez Using the In-Memory Columnar Store to Perform Real-Time Analysis of CERN Data Maaike Limper Emil Pilecki Manuel Martín Márquez About the speakers Maaike Limper Physicist and project leader Manuel Martín

More information

Developing a Research Data Policy

Developing a Research Data Policy Developing a Research Data Policy Core Elements of the Content of a Research Data Management Policy This document may be useful for defining research data, explaining what RDM is, illustrating workflows,

More information

Gigabyte Bandwidth Enables Global Co-Laboratories

Gigabyte Bandwidth Enables Global Co-Laboratories Gigabyte Bandwidth Enables Global Co-Laboratories Prof. Harvey Newman, Caltech Jim Gray, Microsoft Presented at Windows Hardware Engineering Conference Seattle, WA, 2 May 2004 Credits: This represents

More information

The Materials Data Facility

The Materials Data Facility The Materials Data Facility Ben Blaiszik (blaiszik@uchicago.edu), Kyle Chard (chard@uchicago.edu) Ian Foster (foster@uchicago.edu) materialsdatafacility.org What is MDF? We aim to make it simple for materials

More information

Cluster Setup and Distributed File System

Cluster Setup and Distributed File System Cluster Setup and Distributed File System R&D Storage for the R&D Storage Group People Involved Gaetano Capasso - INFN-Naples Domenico Del Prete INFN-Naples Diacono Domenico INFN-Bari Donvito Giacinto

More information

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop

More information

Data Movement & Tiering with DMF 7

Data Movement & Tiering with DMF 7 Data Movement & Tiering with DMF 7 Kirill Malkin Director of Engineering April 2019 Why Move or Tier Data? We wish we could keep everything in DRAM, but It s volatile It s expensive Data in Memory 2 Why

More information

Grid Computing: dealing with GB/s dataflows

Grid Computing: dealing with GB/s dataflows Grid Computing: dealing with GB/s dataflows Jan Just Keijser, Nikhef janjust@nikhef.nl David Groep, NIKHEF 3 May 2012 Graphics: Real Time Monitor, Gidon Moont, Imperial College London, see http://gridportal.hep.ph.ic.ac.uk/rtm/

More information

Opportunities A Realistic Study of Costs Associated

Opportunities A Realistic Study of Costs Associated e-fiscal Summer Workshop Opportunities A Realistic Study of Costs Associated X to Datacenter Installation and Operation in a Research Institute can we do EVEN better? Samos, 3rd July 2012 Jesús Marco de

More information

Lecture 11 Hadoop & Spark

Lecture 11 Hadoop & Spark Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem

More information

From Internet Data Centers to Data Centers in the Cloud

From Internet Data Centers to Data Centers in the Cloud From Internet Data Centers to Data Centers in the Cloud This case study is a short extract from a keynote address given to the Doctoral Symposium at Middleware 2009 by Lucy Cherkasova of HP Research Labs

More information

Big Data and Cloud Computing

Big Data and Cloud Computing Big Data and Cloud Computing Presented at Faculty of Computer Science University of Murcia Presenter: Muhammad Fahim, PhD Department of Computer Eng. Istanbul S. Zaim University, Istanbul, Turkey About

More information

CISC 7610 Lecture 2b The beginnings of NoSQL

CISC 7610 Lecture 2b The beginnings of NoSQL CISC 7610 Lecture 2b The beginnings of NoSQL Topics: Big Data Google s infrastructure Hadoop: open google infrastructure Scaling through sharding CAP theorem Amazon s Dynamo 5 V s of big data Everyone

More information

Introduction to Grid Computing

Introduction to Grid Computing Milestone 2 Include the names of the papers You only have a page be selective about what you include Be specific; summarize the authors contributions, not just what the paper is about. You might be able

More information

CS 6240: Parallel Data Processing in MapReduce: Module 1. Mirek Riedewald

CS 6240: Parallel Data Processing in MapReduce: Module 1. Mirek Riedewald CS 6240: Parallel Data Processing in MapReduce: Module 1 Mirek Riedewald Why Parallel Processing? Answer 1: Big Data 2 How Much Information? Source: http://www2.sims.berkeley.edu/research/projects/ho w-much-info-2003/execsum.htm

More information

Storage and I/O requirements of the LHC experiments

Storage and I/O requirements of the LHC experiments Storage and I/O requirements of the LHC experiments Sverre Jarp CERN openlab, IT Dept where the Web was born 22 June 2006 OpenFabrics Workshop, Paris 1 Briefly about CERN 22 June 2006 OpenFabrics Workshop,

More information

HPC Growing Pains. IT Lessons Learned from the Biomedical Data Deluge

HPC Growing Pains. IT Lessons Learned from the Biomedical Data Deluge HPC Growing Pains IT Lessons Learned from the Biomedical Data Deluge John L. Wofford Center for Computational Biology & Bioinformatics Columbia University What is? Internationally recognized biomedical

More information

THE COMPLETE GUIDE COUCHBASE BACKUP & RECOVERY

THE COMPLETE GUIDE COUCHBASE BACKUP & RECOVERY THE COMPLETE GUIDE COUCHBASE BACKUP & RECOVERY INTRODUCTION Driven by the need to remain competitive and differentiate themselves, organizations are undergoing digital transformations and becoming increasingly

More information

Long Term Data Preservation for CDF at INFN-CNAF

Long Term Data Preservation for CDF at INFN-CNAF Long Term Data Preservation for CDF at INFN-CNAF S. Amerio 1, L. Chiarelli 2, L. dell Agnello 3, D. De Girolamo 3, D. Gregori 3, M. Pezzi 3, A. Prosperini 3, P. Ricci 3, F. Rosso 3, and S. Zani 3 1 University

More information

The LHC Computing Grid

The LHC Computing Grid The LHC Computing Grid Visit of Finnish IT Centre for Science CSC Board Members Finland Tuesday 19 th May 2009 Frédéric Hemmer IT Department Head The LHC and Detectors Outline Computing Challenges Current

More information

Writing a Data Management Plan A guide for the perplexed

Writing a Data Management Plan A guide for the perplexed March 29, 2012 Writing a Data Management Plan A guide for the perplexed Agenda Rationale and Motivations for Data Management Plans Data and data structures Metadata and provenance Provisions for privacy,

More information

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018 Cloud Computing 2 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning

More information

Perspectives on Open Data in Science Open Data in Science: Challenges & Opportunities for Europe

Perspectives on Open Data in Science Open Data in Science: Challenges & Opportunities for Europe Perspectives on Open Data in Science Open Data in Science: Challenges & Opportunities for Europe Stephane Berghmans, DVM PhD 31 January 2018 9 When talking about data, we talk about All forms of research

More information