Design Document. Version 1.1. (A Project sponsored by HadoopExpress.com, a Net Serpents enterprise) All rights reserved

Similar documents
Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Big Data with Hadoop Ecosystem

Introduction to BigData, Hadoop:-

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Innovatus Technologies

Microsoft Perform Data Engineering on Microsoft Azure HDInsight.

Big Data Analytics using Apache Hadoop and Spark with Scala

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?

Introduction to Hive Cloudera, Inc.

Big Data and Hadoop. Course Curriculum: Your 10 Module Learning Plan. About Edureka

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

Oracle Data Integrator 12c: Integration and Administration

Data Science Bootcamp Curriculum. NYC Data Science Academy

Big Data Architect.

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Hadoop Online Training

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Blended Learning Outline: Cloudera Data Analyst Training (171219a)


Trafodion Enterprise-Class Transactional SQL-on-HBase

Jaql. Kevin Beyer, Vuk Ercegovac, Eugene Shekita, Jun Rao, Ning Li, Sandeep Tata. IBM Almaden Research Center

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

Integrate MATLAB Analytics into Enterprise Applications

Big Data Hadoop Stack

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture

Hadoop, Yarn and Beyond

Hadoop & Big Data Analytics Complete Practical & Real-time Training

Oracle Big Data Cloud Service, Oracle Storage Cloud Service, Oracle Database Cloud Service

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

The Stratosphere Platform for Big Data Analytics

Using Hive for Data Warehousing

Typical size of data you deal with on a daily basis

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Big Data Infrastructures & Technologies

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Certified Big Data and Hadoop Course Curriculum

Hadoop Development Introduction

Data Analytics using MapReduce framework for DB2's Large Scale XML Data Processing

Cost-Benefit Analysis of Retrospective vs. Prospective Data Standardization

Lambda Architecture for Batch and Stream Processing. October 2018

STUDENT GRADE IMPROVEMENT INHIGHER STUDIES

Syncsort Incorporated, 2016

This is a brief tutorial that explains how to make use of Sqoop in Hadoop ecosystem.

Package mds. April 27, 2018

Oracle Big Data Fundamentals Ed 2

Wearable Technology Orientation Using Big Data Analytics for Improving Quality of Human Life

Oracle 1Z Oracle Big Data 2017 Implementation Essentials.

Exam Questions

A Review Paper on Big data & Hadoop

Migrate from Netezza Workload Migration

MAPR TECHNOLOGIES, INC. TECHNICAL BRIEF APRIL 2017 MAPR SNAPSHOTS

Hadoop course content

Oracle Data Integrator 12c: Integration and Administration

Instructor : Dr. Sunnie Chung. Independent Study Spring Pentaho. 1 P a g e

IBM BigInsights Security Implementation: Part 1 Introduction to Security Architecture

Chase Wu New Jersey Institute of Technology

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Specialist ICT Learning

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

MAPR DATA GOVERNANCE WITHOUT COMPROMISE

microsoft

Certified Big Data Hadoop and Spark Scala Course Curriculum

Pig A language for data processing in Hadoop

Analyzing Big Data with Microsoft R

Benchmarks Prove the Value of an Analytical Database for Big Data

Data warehousing on Hadoop. Marek Grzenkowicz Roche Polska

exam. Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0

Data Management Glossary

Hortonworks Certified Developer (HDPCD Exam) Training Program

@Pentaho #BigDataWebSeries

A Review Approach for Big Data and Hadoop Technology

Analyze Big Data Faster and Store It Cheaper

TIBCO Foresight Products

COURSE 20466D: IMPLEMENTING DATA MODELS AND REPORTS WITH MICROSOFT SQL SERVER

Information Management Fundamentals by Dave Wells

##SQLSatMadrid. Project [Vélib by Cortana]

Column Stores and HBase. Rui LIU, Maksim Hrytsenia

Modern Data Warehouse The New Approach to Azure BI

Hive SQL over Hadoop

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

Data Storage Infrastructure at Facebook

Creating Connection With Hive. Version: 16.0

Overview. Audience profile. At course completion. Course Outline. : 20773A: Analyzing Big Data with Microsoft R. Course Outline :: 20773A::

Expert Lecture plan proposal Hadoop& itsapplication

Oracle Big Data SQL. Release 3.2. Rich SQL Processing on All Data

Project Requirements

Talend Open Studio for Data Quality. User Guide 5.5.2

Microsoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo

Hadoop is supplemented by an ecosystem of open source projects IBM Corporation. How to Analyze Large Data Sets in Hadoop

Data Science Training

Oracle. Oracle Big Data 2017 Implementation Essentials. 1z Version: Demo. [ Total Questions: 10] Web:

Nida Afreen Rizvi 1, Anjana Pandey 2, Ratish Agrawal 2 and Mahesh Pawar 2

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.

Oracle Big Data Connectors

A Fast and High Throughput SQL Query System for Big Data

Business Cockpit. Controlling the digital enterprise. Fabio Casati Hewlett-Packard

Transcription:

Design Document Product Intelligence on Medical Devices Version 1.1 (A Project sponsored by HadoopExpress.com, a Net Serpents enterprise) copyright 2016 Net Serpents LLC, 2001 Route 46, Parsippany NJ 07054 All rights reserved This document contains material protected under international and federal copyright laws and treaties. Any unauthorized reprint or use of material is strictly prohibited 1

Contents 1. Introduction 1.1. Scope 1.2. Overview 2. General Description 2.1 POC Perspective 2.2 Tools Used 2.3 General Constraints 2.4 Assumption 3. Design Details 3.1 Technology Architecture 3.2 Database Schema 3.3 R Shiny User Interface 4. Reusable Components 2

1. Introduction 1.1 Scope The Design document give us an overview on the architectural design and the prediction model to be implemented for product intelligence on medical devices using the MAUDE (Manufacturer and User Facility Device Experience) data available in FDA site. It also shows how big data components are used to store raw data, preprocess and build a predictive model. The Product Intelligence on Medical Devices shall also support in informed decision making based on the forecasting that was implemented as part of this scope. 2. General Description 2.1 Proof of Concept Perspective The datasets present in the local file system are moved to HDFS. Preprocessing, filtering and cleaning of data is done through shell scripting in HDFS and the cleaned datasets are moved to HIVE. Final dataset is prepared in HIVE comprising of the following columns MDR Report Key Device Event Key Brand Name Generic Name Manufacturer Name Manufacturer City 3

Manufacturer State Code Manufacturer Country Code Problem Description Number of Devices in Event Number of Patients in Event Adverse Event Flag Date of Event Event Location Remedial Action Event Type 2.2 Big Data Components Used 1. Apache Hadoop Distributed File System (HDFS) is a Java based filed system that provides scalable and reliable data storage 2. Apache HIVE, a database query interface from Apache Hadoop, which works like SQL. 3. RStudio, open source tool for statistical computing and graphics 2.3 General Constraints The process is to be automated and scalable as much as possible for reusability purpose. 2.4 Assumption The project is based on the idea of Post Market Surveillance of Medical Devices and the goal is to make this reality involving Big Data and Analytics capabilities. 3. Design Details 3.1 Technology Architecture 4

The MAUDE data from FDA site has been downloaded to file system using shell scripts and moved onto HDFS. Once the raw data is available in HDFS, On top of HDFS files HIVE tables have been created and data has been loaded into tables. Integration of tables has been done in HIVE along with filtering and preprocessing of all tables data. Once the final processed data is available in HIVE, Using R Streaming we are going to use the final table data into R for model building and visualization. 3.2 Database Schema There are totally four datasets downloaded, preprocessed and stored in hive. 5

Notes: foidevxxxx relates to device data for xxxx year mdrfoithru2015 relates to master event data Final data table is to be derived from the above two plus foi_devproblem and deviceproblemcodes MDR Report key should be used to join the tables 3.2.1 Source Files 3.2.1.1 MDRFOITHRU2015: This file contains following master event data data in the order presented. Bold columns to be extracted. 1. MDR_REPORT_KEY 2. EVENT_KEY 6

3. REPORT_NUMBER 4. REPORT_SOURCE_CODE 5. MANUFACTURER_LINK_FLAG_ 6. NUMBER_DEVICES_IN_EVENT 7. NUMBER_PATIENTS_IN_EVENT 8. DATE_RECEIVED 9. ADVERSE_EVENT_FLAG 10. PRODUCT_PROBLEM_FLAG 11. DATE_REPORT 12. DATE_OF_EVENT 13. REPROCESSED_AND_REUSED_FLAG 14. REPORTER_OCCUPATION_CODE 15. HEALTH_PROFESSIONAL 16. INITIAL_REPORT_TO_FDA 17. DATE_FACILITY_AWARE 18. REPORT_DATE 19. REPORT_TO_FDA 20. DATE_REPORT_TO_FDA 21. EVENT_LOCATION 22. DATE_REPORT_TO_MANUFACTURER 23. MANUFACTURER_CONTACT_T_NAME 24. MANUFACTURER_CONTACT_F_NAME 25. MANUFACTURER_CONTACT_L_NAME 26. MANUFACTURER_CONTACT_STREET_1 27. MANUFACTURER_CONTACT_STREET_2 28. MANUFACTURER_CONTACT_CITY 29. MANUFACTURER_CONTACT_STATE 30. MANUFACTURER_CONTACT_ZIP_CODE 31. MANUFACTURER_CONTACT_ZIP_EXT 32. MANUFACTURER_CONTACT_COUNTRY 33. MANUFACTURER_CONTACT_POSTAL 34. MANUFACTURER_CONTACT_AREA_CODE 35. MANUFACTURER_CONTACT_EXCHANGE 36. MANUFACTURER_CONTACT_PHONE_NO 37. MANUFACTURER_CONTACT_EXTENSION 38. MANUFACTURER_CONTACT_PCOUNTRY 39. MANUFACTURER_CONTACT_PCITY 40. MANUFACTURER_CONTACT_PLOCAL 41. MANUFACTURER_G1_NAME 42. MANUFACTURER_G1_STREET_1 43. MANUFACTURER_G1_STREET_2 44. MANUFACTURER_G1_CITY 45. MANUFACTURER_G1_STATE_CODE 46. MANUFACTURER_G1_ZIP_CODE 47. MANUFACTURER_G1_ZIP_CODE_EXT 48. MANUFACTURER_G1_COUNTRY_CODE 49. MANUFACTURER_G1_POSTAL_CODE 50. DATE_MANUFACTURER_RECEIVED 51. DEVICE_DATE_OF_MANUFACTURE 52. SINGLE_USE_FLAG 53. REMEDIAL_ACTION 7

54. PREVIOUS_USE_CODE 55. REMOVAL_CORRECTION_NUMBER 56. EVENT_TYPE 57. DISTRIBUTOR_NAME 58. DISTRIBUTOR_ADDRESS_1 59. DISTRIBUTOR_ADDRESS_2 60. DISTRIBUTOR_CITY 61. DISTRIBUTOR_STATE_CODE 62. DISTRIBUTOR_ZIP_CODE 63. DISTRIBUTOR_ZIP_CODE_EXT 64. REPORT_TO_MANUFACTURER 65. MANUFACTURER_NAME 66. MANUFACTURER_ADDRESS_1 67. MANUFACTURER_ADDRESS_2 68. MANUFACTURER_CITY 69. MANUFACTURER_STATE_CODE 70. MANUFACTURER_ZIP_CODE 71. MANUFACTURER_ZIP_CODE_EXT 72. MANUFACTURER_COUNTRY_CODE 73. MANUFACTURER_POSTAL_CODE 74. TYPE_OF_REPORT 75. SOURCE_TYPE 76. DATE_ADDED 77. DATE_CHANGED 3.2.1.2 foidevxxxx.zip (Device Data) file where xxxx is the year contains following 45 fields, delimited by pipe ( ), one record per line: 1. MDR Report Key 2. Device Event key 3. Implant Flag -- D6, new added; 2006 4. Date Removed Flag -- D7, new added; 2006; if flag in M or Y, print Date U = Unknown A = Not available I = No information at this time M = Month and year provided only, day defaults to 01 Y = Year provided only, day defaulted to 01, month defaulted to January 5. Device Sequence No -- from device report table 6. Date Received (from mdr_document table) SECTION-D 7. Brand Name (D1) 8. Generic Name (D2) 9. Manufacturer Name (D3) 10. Manufacturer Address 1 (D3) 11. Manufacturer Address 2 (D3) 12. Manufacturer City (D3) 13. Manufacturer State Code (D3) 14. Manufacturer Zip Code (D3) 15. Manufacturer Zip Code ext (D3) 16. Manufacturer Country Code (D3) 8

17. Manufacturer Postal Code (D3) 18. Expiration Date of Device (D4) 19. Model Number (D4) 20. Catalog Number (D4) 21. Lot Number (D4) 22. Other ID Number (D4) 23. Device Operator (D5) 24. Device Availability (D10) Y = Yes N = No R = Device was returned to manufacturer * = No answer provided 25. Date Returned to Manufacturer (D10) 26. Device Report Product Code 27. Device Age (F9) 28. Device Evaluated by Manufacturer (H3) Y = Yes N = No R = Device not returned to manufacturer * = No answer provided BASELINE SECTION (for records prior to 2009) 29. Baseline brand name 30. Baseline generic name 31. Baseline model no 32. Baseline catalog no 33. Baseline other id no 34. Baseline device family 35. Baseline shelf life contained in label Y = Yes N = No A = Not applicable * = No answer provided 36. Baseline shelf life in months 37. Baseline PMA flag 38. Baseline PMA no 39. Baseline 510(k) flag 40. Baseline 510(k) no 41. Baseline preamendment 42. Baseline transitional 43. Baseline 510(k exempt flag 44. Baseline date) first marketed 45. Baseline date ceased marketing 3.2.1.3 foidevproblem (Device Data for foidev problem) contains following 2 fields, delimited by pipe ( ), one record per line: 1. MDR Report Key 2. Device Problem Code -- (F10) new added; 2006 3.2.1.4 DEVICEPROBLEMCODES contains following 2 fields, delimited by pipe ( ), one record per line: 9

1. Device Problem Code 2. Problem Description 3.3 R Shiny User Interface The user interface is built using R shiny web application a very simple plain layout showcasing the visualization, classification tree and forecasted data. Screen shots have been provided below to demonstrate the user interface. Visualization For visualization of number of events reported in every state year wise, a thematic map Chloropleth is used. The shading in the map represents the range of events occurred in every state. 10

Forecasting 4. Reusable Components Shell Scripting To crawl MAUDE data from FDA site and load Data into Hadoop environment Big Data Storage and Preprocessing HDFS To Store the raw data which downloaded and Preprocessed Data which is integrated and to be used for reporting as well as model building. HIVE Created Hive tables corresponding to master data, Device data, Device problem codes data and Device problem data. Using the common key, tables have merged together to form final datasets to be used for model building. In the process of preparing final datasets we have created custom UDF s also in HIVE to get data in required format. 11

Analytics R For building a classification tree and forecasting model R Web Interface using Shiny To create Interactive Light Weight Web application which can be accessible on Intranet / Corporate LANs. 12