I am a Data Nerd and so are YOU!
Not This Type of Nerd
Data Nerd Coffee Talk We saw Cloudera as the lone open source champion of Hadoop and the EMC/Greenplum/MapR initiative as a more closed and proprietary long shot We are the only real-time data store that combines fine-grained security controls, scalability to the 10s of petabytes, flexible schemas for unstructured and semi-structured information, and diverse analytical capabilities I am working on a framework to allow construction of large-scale analytical queries on unstructured data The IoT and big data are clearly growing apace, and are set to transform many areas everyday life. But which particular sectors are likely to feel the IoT/big data disruption first?
What is Big Data?
What is Big Data? Big data is like teenage sex: Everyone talks about it, Nobody really knows how to do it, Everyone thinks everyone else is doing it, So everyone claims they are doing it -Dan Ariely
40 ZETTABYTES 2.5 Quintillion Bytes of data will be created by 2020, an increase of 300 times from 2005 6 BILLION PEOPLE have cell phones of data created each day Volume SCALE OF DATA Most companies in the US have at least 100 TERABYTES of data stored The Four V s of BIG DATA: Global healthcare data size estimated to be 150 EXABYTES Variety DIFFERENT FORMS OF DATA 400 MILLION TWEETS are sent per day by about 200 million monthly users By 2015, it s anticipated there will be 420 MILLION WEARABLE, WIRELESS HEALTH MONITORS 4 BILLION+ HOURS OF VIDEO are watched on YouTube each month NYSE captures 1 TB OF TRADE INFORMATION during each trading session Velocity STREAMING DATA Source: McKinsey Global Institute Modern cares have close to 100 SENSORS that monitor fuel levels, tire pressure, etc Problems that are unsolvable using traditional tools 1 IN 3 BUSINESS LEADERS don t trust the information they user to make decisions Veracity UNCERTAINTY OF DATA Poor data quality costs the US economy around $3.1 TRILLION A YEAR 27% OF RESPONDENTS were unsure of how much of their data was inaccurate
Data Example: Structured Relational
Data Example: Structured XML/HL7
Data Example: Unstructured Log File
Data Example: Unstructured Tweet
Data Example: Unstructured Sensor
Data Example: Unstructured Image
Summary of Common Clinical Data ICD CPT Lab Medication Clinical Notes Availability High High High Medium Medium Recall Medium Poor Medium Precision Medium High High Format Structured Structured Pros Cons Easy to work with, a good approximation of disease Disease code often used for screening, there fore disease might not be there Easy to work with, high precision Missing data Mostly structured High data validity Data normalization and ranges Inpatient: High Outpatient: Variable Inpatient: High Outpatient: Variable Structured and unstructured High data validity Prescribed not necessarily taken Medium Medium high Unstructured More details about doctors thoughts Difficult to process Source: Joshua Denny: Mining Electronic Health Records
Hadoop Hadoop Distributed File System (HDFS) Commodity Hardware Files stored as blocks Reliability achieved through replication (clusters) Master Node [few] JobTracker: Distributes Map/Reduce Tasks TaskTracker: Receives Map/Reduce Tasks NameNode: Coordinate meta-data DataNode: Stores & replicates data across cluster Worker Node: Processing Power [many] No data caching Open-Source Framework Allows for Multiple Programming Language Parallel processing Task-based programming logic Components will fail at high rate (and that s ok) Map / Reduce (aka Divide and Conquer) Data will be contained in a relatively small number of big files Data files are write-once Lots of streaming reas Higher sustained throughput across large amounts of data Distributions & Programming Tools
Hadoop in the Enterprise SCIENCE Medical imaging, sensor data, genome sequencing, EHR XML Environment Enterprise INDUSTRY Financial, Pharma, manufacturing, insurance, fraud, retail LEGACY Claims, sales, customer behavior, product databases, accounting CSV EDI LOG SQL TXT JSON BIN JPEG Create Map Reduce Commodity Server Cloud Import Dashboards Business Intelligence Applications SYSTEM DATA Log files, health feeds, activity streams, network messages, web analytics OBJ KLM Hadoop Distributed File System (HDFS) RDBMS 1 High Volume 2 MapReduce Process 3 Data Flows Consumer Results Source: www.ebizq.net/blogs/enterprise
Data Science Big Data needs Data Science but Data Science doesn t need Big Data Carla Gentry aka @data_nerd
Dr. Ashenfelter 12.145 + 0.00117 * winter rainfall + 0.0614* average growing season temperature = 0.00386 * harvest rainfall
Analytic Complexity Analytic Approaches Small amount of data or samples (MB/GB) Large amount of data (TB/PB) Advanced Analytics Smaller data sets using advanced techniques to analyze impact of future scenarios Big-Data Analytics Can fuse different data types on a massive scale resulting in predictive and real-time analysis capabilities Predictive & Real-time analytic capabilities Basic Analytics Relies on historical observations to help avoid past mistakes and duplicate past successes Big-Data Computing Data become more consolidated while analytic work flows are more streamlined and automated Accurate historical observations Size of data Source: Booz Allen Hamilton
Matthew 5:5 (Ultra-Revised Standard Version) And the Data Nerds Shall Inherit the Earth
Lucas M Tramontozzi SCI Solutions ltramontozz@scisolutions.com 202-669-4715 @bigdatadaddy
I am a Data Nerd and so are YOU!