Co-existence: Can Big Data and Big Computation Co-exist on the Same Systems?

Size: px

Start display at page:

Download "Co-existence: Can Big Data and Big Computation Co-exist on the Same Systems?"

Cornelia Skinner
5 years ago
Views:

1 Co-existence: Can Big Data and Big Computation Co-exist on the Same Systems? Dr. William Kramer National Center for Supercomputing Applications, University of Illinois

2 Where these views come from Large scale simulation CFD, MD, Fusion, Materials, Chemistry, Climate, Weather, Seismic, Structures, Large scale experimental systems High Energy and Nuclear Physics LHC (CMS, STAR), SNO, Astronomy SNF/SNAP, CMB, DES, LSST Genomics Genomics automation, analysis and workflow Self Organizing Networks Networks and CyberInfrastructures NASnet, Esnet, Open Science Grid, XSEDE Cyber Protection and Security Intrusion Detection, other Resiliency Large System State and Response Design and Implementation of HSMs NAStore, HPSS,. National Aerospace System Free Flight, AATT 2

3 Types of Big Data Computer Processed Semi-Structured Data Been doing this a long time and reasonable well Structured Observational Data Been doing this a long time and reasonable well Unstructured Observational Data Capabilities now allow us to consider doing this at unprecedented scale 3

4 Computer Processed Semi-Structured Data Examples Simulation Coordinated data assimilation with analysis Traditional Business processing Characteristics Structured file based I/O Many to many, many to few, few to many Claim to be about a few big files in reality there are many small files Format is application specific, investigator specific and sometimes domain specific Parallel file systems used Performance, Reliability, Management Uses levels of storage devices On-line disk, tape, Significant amounts of the data is published via copy and post methods PCMDI, QCD Lattices, Protein data bases, 4

5 Structured Observational Data Examples HEP/NP LHC, CMS, SNO, Astronomy DES, LSST, SKA, Supernova, CMB EOS Characteristics Structured file based I/O Domain specific meta-data structure Custom, Data base, Parallel and non-parallel file systems used sometimes not Uses levels of storage devices On-line disk, tape, Globally accessible Much of the data is shared in a distributed hierarchy Tier 0, 1, 2, 3 Mechanisms for automatic discovery and retrieval 5

6 UnStructured Observational Data Examples textual - Tweets, , documents, genomic sequence segments, log files Images youtube, surveillance videos, images, Combined - Medical records Other manufacturing and vehicle control systems Characteristics Often minimal metadata initially Significant background processing to improve organization and retrieval Hadoop or other custom filesystems Asynchronous creation Small atomic units mostly randomly accessed Storage System Typically only on-line but that may change Mostly local storage on nodes have to schedule work on nodes with data or move the data Coordination via reading and writing files Much of the data is served after simple searches via portals and browsers Mechanisms for automatic discovery and retrieval 6

Clusters sized for similar performance of Mapreduce tests TestDFSIO, Hadoop Sort,

7 Example Cost Comparison From White Paper by Xyratex - Map/Reduce on Lustre - Hadoop Performance in HPC Environments by Nathan Rutman Can HPC file systems do better then HDFS? Clusters sized for similar performance of Mapreduce tests TestDFSIO, Hadoop Sort, read/write, etc. Based partially on storing 3 copies is more expensive than RAID storage + controller/osts 7

8 Resource Management Resource Scheduling Functions on large scale systems have the features needed to schedule work as people expect for Big Data Scheduling decisions are based on culture that is made up of users, providers, stack holders, etc. Queuing theory determines the tradeoffs of utilization vs response time In Big Data batch background work is done for what we perceive as interactive query response E.g. crawling and indexing web pages, image preparation, weather products, Changes required Need to schedule bandwidth not processors How many units need to be schedule as a unit For Big Data need to move computation to the data or move the data Can do at least space sharing within a system 8

9 Architecture: Architecture What architectural changes are needed for extreme computing storage systems to make them better suited for BD? Better small scale atomic I/O Solid State Storage? A new storage repository non POSIX? Seamless storage hierarchies What operational changes are needed to support new storage architectures? Yes critical resource is bandwidth not CPU Looking at future technologies, what future architectures are possible? Interconnect is the most essential. Processor technology can be whatever it is. Energy efficient memory 9

10 What do we need to investigate Software layers that interface the map reduce programming framework to HPC file systems AND Software layers that run Parallel POSIX I/O on HDFS implementations What lessons from HPC parallelism can be applied to Big Data Applications Replace workflow communication by files with communication by memory Create robust time to solution performance evaluation suites that can be used to explore claims of price performance of architectures and implementations Representing all three types of data use, and know which use models require them Need to manage non-traditional resources e.g. bandwidth rather than processors Need to manage time to solution rather than time to start Change mind share Traditional HPC = maximizing CPU use Big Data = CPU is not an important resource Understand the role(s) of virtualization and Virtual Machines 10

11 Acknowledgements This work is part of the Blue Waters sustained-petascale computing project, which is supported by the National Science Foundation (award number OCI ) and the state of Illinois. Blue Waters is a joint effort of the University of Illinois at Urbana-Champaign, its National Center for Supercomputing Applications, Cray, and the Great Lakes Consortium for Petascale Computation. The work described is achievable through the efforts of the many other on different teams. 11

Leveraging Software-Defined Storage to Meet Today and Tomorrow s Infrastructure Demands

Leveraging Software-Defined Storage to Meet Today and Tomorrow s Infrastructure Demands Unleash Your Data Center s Hidden Power September 16, 2014 Molly Rector CMO, EVP Product Management & WW Marketing