Introduc)on to Big Data Pradeep Sivakumar, Sr. HPC Specialist David King, Sr. HPC Systems Engineer Research Compu.ng Services Cunera Buys, e- science Librarian NU Library
Table of Contents What is Big Data? CharacterisCcs 3 V s Why is it relevant? Examples Challenges Moving Processing Management
What is Big Data? Big Data are high- volume, high- velocity, and/or high- variety informacon assets that require new forms of processing to enable enhanced decision making, insight discovery and process opcmizacon. Doug Lancy. 2001. 3- D Data Management: Controlling Data Volume, Velocity and Variety. Gartner Group
Data Velocity Big Data Characteris)cs (3 Vs) Data Volume PB Data Variety
Why is it relevant? The model of generacng/consuming data has changed Progress and innovacon no longer hindered by the ability to collect data Ability to manage, analyze, visualize, share, and discover knowledge from the collected data in a Cmely and scalable fashion is criccal The Expanding Digital Universe, IDC, 2007
Examples: Sloan Digital Sky Survey (1992-2008) The Cosmic Genome Project Conducted at Apache Point Observatory, New Mexico Goal: Map one quarter of the sky and create a systemacc, three- dimensional picture of the universe Two surveys in one Photometric survey in 5 bands Spectroscopic redshi\ survey Data is public 2.5 Terapixels of images 10 TB of raw data => 120 TB processed 0.5 TB catalogs => 35 TB in the end *Images credit Fermilab Visual Media Services
Human Genome Project (1990-2003) Sequence the human genome in order to track down the genes responsible for inherited diseases Cost of sequencing human genome has gone down $40 million (2003) $5000 (2013) Spurred innovacon of new visualizacon methods, more robust analysis tools Genomics produces huge volumes of data Human genome has 3 million base pairs, 20000-25000 genes Equivalent to 100 GB *Images credit U.S Department of Energy Genomics Science Program
*Images credit CERN Large Hadron Collider at CERN Goal: Smash protons moving at 99.999999% of the speed of light into each other, beams of protons collide in four points (ALICE, ATLAS, CMS, and LHCb) In November 2012, LHC recorded the first observacons of the Higgs boson parccle (CMS, ATLAS) LHC produces 15 PB annually. Data needs to be accessed and analyzed by thousands of sciencsts internaconally (2000 physicists, 31 countries) Grid infrastructure spread over US, Europe to coordinate data analysis involving networks conneccng dozens of sites and thousands of systems
Challenges in handling Big Data Storage Cost of storage, scalability in throughput Network Large datasets shared among an internaconal community Making data immediate to researchers by providing a local area network Data integrity Including mulcple copies or some form of backup Metadata, provenance, and ontologies
Challenges in handling Big Data (contd) Open access Security issues. Can we support controlled sharing? Very long term data preservacon Preserve datasets for an unlimited Cmespan Technology New architecture, algorithms, techniques are needed Experts in using the new technology and dealing with Big Data
Tradi)onal Data Storage Models TradiConally, data is stored on disk arrays (SAN) Limited storage/compute capacity Throughput is limited to speed of the interface/disks Scalability is limited Expensive Thumb drives and portable drives Reliability Real Cme data processing (velocity vector) is not possible Management, data integracon becomes challenging
Emerging Storage and Data Analysis Trends Break the data apart and distribute it across mulcple machines ComputaCon is done locally near the storage DataSet Block 1 Block 2 Block 3 Data replicacon is built- Block 4 in to the filesystem Block 5 Replica Data/Compute Nodes
Tradi)onal Data Processing Batch processing Generally not real Cme Desktop compucng Slow computacon and throughput Generally limited networking RelaConal Databases (MySQL) Performance issues at large data size Requires structured data
Emerging Big Data Processing Methods Schema- less databases NoSQL databases Processing the data near or on the data cluster MapReduce Massive parallel job execucon of jobs across large data clusters Cloud Storage/Processing Amazon ElasCc MapReduce Google Cloud Plajorm
Processing Datasets using Hadoop Framework that allows for processing of large data sets (Volume) of various data types (Variety) Data stored on commodity hard drives across mulcple machines Allows for scalability Reliability High throughput Low storage latency LimitaCons include OpCmized for moderate to large datasets AdministraCve overhead
Hadoop
Tradi)onal commodity networks Not designed for big data flows Firewalls/intrusion proteccon does not scale Latency becomes a factor Sharing of data is o\en unreliable
Specialized Research Networks IsolaCon of data through Science DMZ model Dedicated infrastructure Guaranteed throughput for real Cme data Less latency through dedicated fiber links
A Case Study for Big Data : The Large Hadron Collider Compact Muon Spectrometer Tracker
Why is the CMS project is a good example? Open accessibility of data Volume of data Data transported for analysis in real Cme Long term data retencon Data is semi- structured
Grid Infrastructure to Transport CMS Data Online system Tier 0 ~Pbyte/s CERN Center Tier 1 10 40 Gbps IN2P3 Center RAL Center INFN Center FNAL Center Tier 2 Caltech ~10 Gbps Florida MIT Nebraska Purdue UCSD Wisconsin 1 to 10 Gbps Tier 3 NU InsCtute InsCtute InsCtute
NU Tier- 3 Implementa)on Internet and Research Networks (Starlight) 10Gbit/s Administrative Nodes Login Nodes tier3.northwestern.edu ttgrid01 Services Node ttgrid02 Namenode ttgrid03 Compute Element ttgrid04 Storage Element ttlogin01 login node ttlogin02 login node Worker Nodes FDR Infiniband Switch ttnode0001 through ttnode0014
Common Data Lifecycle Stages From: Fary, Michael and Owen, Kim, Developing an InsCtuConal Research Data Management Plan Service, Educause ACTI white paper, January 2013, hqp://net.educause.edu/ir/library/pdf/acti1301.pdf
FUNDER REQUIREMENTS
Data Management Plans InformaCon that should be provided: Types of data to be produced. Standards or descripcons that would be used with the data (metadata). How these data will be accessed and shared. Policies and provisions for data sharing and reuse. Provisions for archiving and preservacon. Assistance for Data Management Plans: DMPTool: hqps://dmp.cdlib.org NU Library Data Management Web Page:hqp:// hqp://www.library.northwestern.edu/ dmp
Why share data? Why make it open? Clearly documents and provides evidence for research in conjunccon with published results. Meet copyright and ethical compliance (i.e. HIPAA). Increases the impact of research through data citacon. Preserves data for long- term access and prevents loss of data. Describes and shares data with others to further new discoveries and research. Prevent duplicacon of research. Accelerates the pace of research. Promotes reproducibility of research.
Describe and deposit: metadata
Deposit on publica)on of ar)cle Some Journal publishers require or recommend that supporcng data for arccles be made publicly available. The Joint Data Archiving Policy (JDAP) requires data sharing in a public archive as a condicon of publicacon. hqp://www.datadryad.org/pages/jdap Journals that have adopted JDAP include: Science, Nature and GeneCcs The author is usually responsible for making data available in repository/ archive. Check data archiving policies of journals before submisng arccles.
Deposit and share via repository Does your project have its own repository? Does your funder? If not, databib (hqp://databib.org/) may be helpful for finding a repository. NU is starcng conversacons around local data archiving needs, and would love to hear your input.
Ques)ons?