Piyush Chaudhary HPC Solutions Development <piyushc@us.ibm.com> Data Centric Computing SPXXL/SCICOMP Summer 2011
Agenda What is Data Centric Computing? What is Driving Data Centric Computing? Puzzle vs. Mystery Characteristics of Data Centric Computing Workloads Hardware Changes Needed to Handle Data Centric Computing Workloads 2 5/11/2011
What is Data Centric Computing? Data-centric computing concerns the acquisition, processing, analysis, storage, and query of data sets and streams. M. Gokhale et al. Not a new concept but the growth in data has caused renewed focus and specialization A Rose by Any Other Name.. Data Intensive Computing Data Mining Data Warehousing Analytics Deep Q&A Big Data How is Data Centric Computing Different from Traditional Compute Centric Computing? Data is moved through various storage hierarchies to compute resources, as needed, in traditional model whereas computation is done where the data lives in the data centric model Typically operations per byte are low in a data centric model compared to compute centric model Data Centric Computing has a shallow and persistent storage hierarchy 3 5/11/2011
What is Driving Data Centric Computing? Business data is doubling every 1.2 years* Companies that adopt data driven decision making achieve 5 6 % improvement in productivity than can be explained by other factors, this difference is enough to separate winners from losers in most industries Based on research by Erik Brynjolfsson (Sloan School of Management, MIT), Lorin Hitt (Wharton School of University of Pennsylvania) and Heekyung Kim (MIT) First quantitative evidence of productivity growth anecdotes Data Centric Computing has been instrumental to the success of companies like Google, Facebook and many more To help companies find meaningful patterns by sifting through business data, companies like IBM, Oracle, SAP and Microsoft have collectively spent over $25B buying up specialist companies in the field IBM alone has spent $14B on 25 companies that focus on data analytics IBM employs over 8000 consultants and 200 mathematicians to focus on analytics IBM expects this business to grow to $16B by 2015 * Lohr, S.;, When there s no such thing as too much information, from The New York Times, April 23, 2011 4 5/11/2011
Puzzle vs. Mystery* Critical piece of data is missing Need to add to data collection Need to develop systems capable of ingesting large amounts of data, summarize and correlate it All the data is available and in fact may have too much data Requires judgment and assessment of uncertainty Need to develop expert systems to analyze the available information, categorize it, rank and correlate it * Gladwell, M.;, Open Secrets Enron, intelligence and the peril of too much information, from The New Yorker, January 8, 2007 URL: http://www.gladwell.com/2007/2007_01_08_a_secrets.html 5 5/11/2011
Characteristics of Data Centric Computing Workloads Characteristics: Pattern Matching in Unstructured Data Real Time or Forensic Exact or Approximate Matching Text, Video, Speech, Web, mixed Record Processing in Structured Data Database requirements Analysis / Computation Graph Assembly and Analysis Correlation and Scoring Sorting Optimization Requirements: Pattern Matching in Unstructured Data Low ops per byte Random access Integer dominated Record Processing in Structured Data Coherency Locking Analysis / Computation (relative to HPC) Mixed Integer / Floating point Computational load per byte low Parallelism in algorithms and data difficult to identify Locality in algorithms and data low 6 5/11/2011
Hardware Changes Required to Handle Data Centric Computing Workloads Note: For a class of data centric computing workloads the current trajectory of systems will be sufficient Processor Fast integer operations Efficient vector integer operations Memory Small latency and high bandwidth for small size random accesses Intelligent prefetching, explicit pattern following Partial cache line fetch Network Small latency and high bandwidth for small size random messages Resilience in the face of multiple link failures (availability and performance) Need similar features for external connectivity Storage Biggest area of concern and needs the most innovation Storage class memory will be key to meeting the challenges 7 5/11/2011
System Architecture Moving data through storage, memory and cache hierarchies is very inefficient for data intensive workloads since the typical operations per byte are low Disk latency and bandwidth trajectories are not on the right track to support data intensive computing workloads Need a rethink in how to build systems to support these workloads Need to embed compute resource with storage/memory 3D packaging with memory and compute units in the same stack Need to provide near line storage based on SCM Need to balance power and performance to match the workload specific needs 8 5/11/2011
A Note on Storage Class Memory (SCM*) The gap between the performance of disks (latency) and the rest of the system, which is already six orders of magnitude, continues to widen Although the areal density of disk platters continues to improve, albeit at a slower rate, the bit error rate per gigabyte is not keeping up and in fact is getting worse Similarly IOP rates are not proportional to the areal density increase The cost of the disks continues to fall but also at a slower rate Power reduction for spinning disks is approaching its limit and the power consumption of the storage subsystem, as a percentage of the total system, is increasing SCM technologies promise to address all these issues by creating compact, robust storage systems with greatly improved cost/performance ratios compared to the state of the art systems today * Freitas, R. F.; Wilcke, W. W.;, "Storage-class memory: The next storage system technology," IBM Journal of Research and Development, vol.52, no.4.5, pp.439-447, July 2008 doi: 10.1147/rd.524.0439 URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5388608&isnumber=5388602 9 5/11/2011