I am a Data Nerd and so are YOU!

Similar documents
What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

<Insert Picture Here> Introduction to Big Data Technology

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Big data. Professor Dan Ariely, Duke University.

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

Hadoop محبوبه دادخواه کارگاه ساالنه آزمایشگاه فناوری وب زمستان 1391

BIG DATA TECHNOLOGIES: WHAT EVERY MANAGER NEEDS TO KNOW ANALYTICS AND FINANCIAL INNOVATION CONFERENCE JUNE 26-29,

ECEN Security and Privacy for Big Data. Introduction Professor Yanmin Gong 08/22/2017

Big Data - Some Words BIG DATA 8/31/2017. Introduction

Massively Parallel Processing. Big Data Really Fast. A Proven In-Memory Analytical Processing Platform for Big Data

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Modern Database Concepts

Big Data Issues for Federal Records Managers

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

When, Where & Why to Use NoSQL?

Sensor Data Collection and Processing

2014 年 3 月 13 日星期四. From Big Data to Big Value Infrastructure Needs and Huawei Best Practice

Bull Fast Track/PDW and Big Data

Embedded Technosolutions

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Massive Online Analysis - Storm,Spark

Fast Innovation requires Fast IT

CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM

Oracle NoSQL Database Overview Marie-Anne Neimat, VP Development

Massive Scalability With InterSystems IRIS Data Platform

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Big Data The end of Data Warehousing?

Spatial Analytics Built for Big Data Platforms

Building an Integrated Big Data & Analytics Infrastructure September 25, 2012 Robert Stackowiak, Vice President Data Systems Architecture Oracle

BIG DATA TESTING: A UNIFIED VIEW

From Internet Data Centers to Data Centers in the Cloud

A Survey on Big Data

Deep Storage for Exponential Data. Nathan Thompson CEO, Spectra Logic

Global Journal of Engineering Science and Research Management

Acquiring Big Data to Realize Business Value

Storing in the Cloud: What You Need to Know Bret Piatt Rackspace Hosting

Data Management Glossary

Approaching the Petabyte Analytic Database: What I learned

HADOOP FRAMEWORK FOR BIG DATA

Distributed Systems CS6421

Increase Value from Big Data with Real-Time Data Integration and Streaming Analytics

MI-PDB, MIE-PDB: Advanced Database Systems

Processing Unstructured Data. Dinesh Priyankara Founder/Principal Architect dinesql Pvt Ltd.

A Survey on Comparative Analysis of Big Data Tools

Introduction to MapReduce

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Introduction to the Mathematics of Big Data. Philippe B. Laval

BigData And the Zoo. Mansour Raad Federal GIS Conference 2014

High Performance Computing on MapReduce Programming Framework

A Review Approach for Big Data and Hadoop Technology

Integrating Oracle Databases with NoSQL Databases for Linux on IBM LinuxONE and z System Servers

Based on Big Data: Hype or Hallelujah? by Elena Baralis

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Big Data and Cloud Computing

Modern Data Warehouse The New Approach to Azure BI

Chapter 6 VIDEO CASES

The Google File System. Alexandru Costan

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns

A REVIEW PAPER ON BIG DATA ANALYTICS

Introduction to HDFS and MapReduce

Maximizing Fraud Prevention Through Disruptive Architectures Delivering speed at scale.

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

A Single Source of Truth

Top 25 Big Data Interview Questions And Answers

How Insurers are Realising the Promise of Big Data

2017 GridGain Systems, Inc. In-Memory Performance Durability of Disk

docs.hortonworks.com

Big Data with Hadoop Ecosystem

Introduction to Big Data. Hadoop. Instituto Politécnico de Tomar. Ricardo Campos

Introduction to Big Data

Typical size of data you deal with on a daily basis

HDFS Federation. Sanjay Radia Founder and Hortonworks. Page 1

A Review Paper on Big data & Hadoop

Chapter 6. Foundations of Business Intelligence: Databases and Information Management VIDEO CASES

Lecture 11 Hadoop & Spark

Distributed File Systems II

Introduction to Oracle NoSQL Database

VOLTDB + HP VERTICA. page

Cloud Analytics and Business Intelligence on AWS

The age of Big Data Big Data for Oracle Database Professionals

relational Relational to Riak Why Move From Relational to Riak? Introduction High Availability Riak At-a-Glance

Stages of Data Processing

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

Big Data Big Mess? Ein Versuch einer Positionierung

Projected by: LUKA CECXLADZE BEQA CHELIDZE Superviser : Nodar Momtsemlidze

Strategic Briefing Paper Big Data

Distributed Computation Models

Big Data Management and NoSQL Databases

Streaming Integration and Intelligence For Automating Time Sensitive Events

Big Data on AWS. Big Data Agility and Performance Delivered in the Cloud. 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

745: Advanced Database Systems

The Mathematics of Big Data

Data Analysis Using MapReduce in Hadoop Environment

Towards Modeling Approach Enabling Efficient Platform for Heterogeneous Big Data Analysis.

An Emerging Trend of Big data for High Volume and Varieties of Data to Search of Agricultural Data

In-Memory Computing Brings Operational Intelligence to Business Challenges DR. WILLIAM L. BAIN SCALEOUT SOFTWARE, INC.

Computer-based Tracking Protocols: Improving Communication between Databases

EMC ISILON HARDWARE PLATFORM

Transcription:

I am a Data Nerd and so are YOU!

Not This Type of Nerd

Data Nerd Coffee Talk We saw Cloudera as the lone open source champion of Hadoop and the EMC/Greenplum/MapR initiative as a more closed and proprietary long shot We are the only real-time data store that combines fine-grained security controls, scalability to the 10s of petabytes, flexible schemas for unstructured and semi-structured information, and diverse analytical capabilities I am working on a framework to allow construction of large-scale analytical queries on unstructured data The IoT and big data are clearly growing apace, and are set to transform many areas everyday life. But which particular sectors are likely to feel the IoT/big data disruption first?

What is Big Data?

What is Big Data? Big data is like teenage sex: Everyone talks about it, Nobody really knows how to do it, Everyone thinks everyone else is doing it, So everyone claims they are doing it -Dan Ariely

40 ZETTABYTES 2.5 Quintillion Bytes of data will be created by 2020, an increase of 300 times from 2005 6 BILLION PEOPLE have cell phones of data created each day Volume SCALE OF DATA Most companies in the US have at least 100 TERABYTES of data stored The Four V s of BIG DATA: Global healthcare data size estimated to be 150 EXABYTES Variety DIFFERENT FORMS OF DATA 400 MILLION TWEETS are sent per day by about 200 million monthly users By 2015, it s anticipated there will be 420 MILLION WEARABLE, WIRELESS HEALTH MONITORS 4 BILLION+ HOURS OF VIDEO are watched on YouTube each month NYSE captures 1 TB OF TRADE INFORMATION during each trading session Velocity STREAMING DATA Source: McKinsey Global Institute Modern cares have close to 100 SENSORS that monitor fuel levels, tire pressure, etc Problems that are unsolvable using traditional tools 1 IN 3 BUSINESS LEADERS don t trust the information they user to make decisions Veracity UNCERTAINTY OF DATA Poor data quality costs the US economy around $3.1 TRILLION A YEAR 27% OF RESPONDENTS were unsure of how much of their data was inaccurate

Data Example: Structured Relational

Data Example: Structured XML/HL7

Data Example: Unstructured Log File

Data Example: Unstructured Tweet

Data Example: Unstructured Sensor

Data Example: Unstructured Image

Summary of Common Clinical Data ICD CPT Lab Medication Clinical Notes Availability High High High Medium Medium Recall Medium Poor Medium Precision Medium High High Format Structured Structured Pros Cons Easy to work with, a good approximation of disease Disease code often used for screening, there fore disease might not be there Easy to work with, high precision Missing data Mostly structured High data validity Data normalization and ranges Inpatient: High Outpatient: Variable Inpatient: High Outpatient: Variable Structured and unstructured High data validity Prescribed not necessarily taken Medium Medium high Unstructured More details about doctors thoughts Difficult to process Source: Joshua Denny: Mining Electronic Health Records

Hadoop Hadoop Distributed File System (HDFS) Commodity Hardware Files stored as blocks Reliability achieved through replication (clusters) Master Node [few] JobTracker: Distributes Map/Reduce Tasks TaskTracker: Receives Map/Reduce Tasks NameNode: Coordinate meta-data DataNode: Stores & replicates data across cluster Worker Node: Processing Power [many] No data caching Open-Source Framework Allows for Multiple Programming Language Parallel processing Task-based programming logic Components will fail at high rate (and that s ok) Map / Reduce (aka Divide and Conquer) Data will be contained in a relatively small number of big files Data files are write-once Lots of streaming reas Higher sustained throughput across large amounts of data Distributions & Programming Tools

Hadoop in the Enterprise SCIENCE Medical imaging, sensor data, genome sequencing, EHR XML Environment Enterprise INDUSTRY Financial, Pharma, manufacturing, insurance, fraud, retail LEGACY Claims, sales, customer behavior, product databases, accounting CSV EDI LOG SQL TXT JSON BIN JPEG Create Map Reduce Commodity Server Cloud Import Dashboards Business Intelligence Applications SYSTEM DATA Log files, health feeds, activity streams, network messages, web analytics OBJ KLM Hadoop Distributed File System (HDFS) RDBMS 1 High Volume 2 MapReduce Process 3 Data Flows Consumer Results Source: www.ebizq.net/blogs/enterprise

Data Science Big Data needs Data Science but Data Science doesn t need Big Data Carla Gentry aka @data_nerd

Dr. Ashenfelter 12.145 + 0.00117 * winter rainfall + 0.0614* average growing season temperature = 0.00386 * harvest rainfall

Analytic Complexity Analytic Approaches Small amount of data or samples (MB/GB) Large amount of data (TB/PB) Advanced Analytics Smaller data sets using advanced techniques to analyze impact of future scenarios Big-Data Analytics Can fuse different data types on a massive scale resulting in predictive and real-time analysis capabilities Predictive & Real-time analytic capabilities Basic Analytics Relies on historical observations to help avoid past mistakes and duplicate past successes Big-Data Computing Data become more consolidated while analytic work flows are more streamlined and automated Accurate historical observations Size of data Source: Booz Allen Hamilton

Matthew 5:5 (Ultra-Revised Standard Version) And the Data Nerds Shall Inherit the Earth

Lucas M Tramontozzi SCI Solutions ltramontozz@scisolutions.com 202-669-4715 @bigdatadaddy

I am a Data Nerd and so are YOU!