Design Document. Version 1.1. (A Project sponsored by HadoopExpress.com, a Net Serpents enterprise) All rights reserved

Size: px

Start display at page:

Jeremy Little
5 years ago
Views:

1 Design Document Product Intelligence on Medical Devices Version 1.1 (A Project sponsored by HadoopExpress.com, a Net Serpents enterprise) copyright 2016 Net Serpents LLC, 2001 Route 46, Parsippany NJ All rights reserved This document contains material protected under international and federal copyright laws and treaties. Any unauthorized reprint or use of material is strictly prohibited 1

2 Contents 1. Introduction 1.1. Scope 1.2. Overview 2. General Description 2.1 POC Perspective 2.2 Tools Used 2.3 General Constraints 2.4 Assumption 3. Design Details 3.1 Technology Architecture 3.2 Database Schema 3.3 R Shiny User Interface 4. Reusable Components 2

3 1. Introduction 1.1 Scope The Design document give us an overview on the architectural design and the prediction model to be implemented for product intelligence on medical devices using the MAUDE (Manufacturer and User Facility Device Experience) data available in FDA site. It also shows how big data components are used to store raw data, preprocess and build a predictive model. The Product Intelligence on Medical Devices shall also support in informed decision making based on the forecasting that was implemented as part of this scope. 2. General Description 2.1 Proof of Concept Perspective The datasets present in the local file system are moved to HDFS. Preprocessing, filtering and cleaning of data is done through shell scripting in HDFS and the cleaned datasets are moved to HIVE. Final dataset is prepared in HIVE comprising of the following columns MDR Report Key Device Event Key Brand Name Generic Name Manufacturer Name Manufacturer City 3

4 Manufacturer State Code Manufacturer Country Code Problem Description Number of Devices in Event Number of Patients in Event Adverse Event Flag Date of Event Event Location Remedial Action Event Type 2.2 Big Data Components Used 1. Apache Hadoop Distributed File System (HDFS) is a Java based filed system that provides scalable and reliable data storage 2. Apache HIVE, a database query interface from Apache Hadoop, which works like SQL. 3. RStudio, open source tool for statistical computing and graphics 2.3 General Constraints The process is to be automated and scalable as much as possible for reusability purpose. 2.4 Assumption The project is based on the idea of Post Market Surveillance of Medical Devices and the goal is to make this reality involving Big Data and Analytics capabilities. 3. Design Details 3.1 Technology Architecture 4

5 The MAUDE data from FDA site has been downloaded to file system using shell scripts and moved onto HDFS. Once the raw data is available in HDFS, On top of HDFS files HIVE tables have been created and data has been loaded into tables. Integration of tables has been done in HIVE along with filtering and preprocessing of all tables data. Once the final processed data is available in HIVE, Using R Streaming we are going to use the final table data into R for model building and visualization. 3.2 Database Schema There are totally four datasets downloaded, preprocessed and stored in hive. 5

6 Notes: foidevxxxx relates to device data for xxxx year mdrfoithru2015 relates to master event data Final data table is to be derived from the above two plus foi_devproblem and deviceproblemcodes MDR Report key should be used to join the tables Source Files MDRFOITHRU2015: This file contains following master event data data in the order presented. Bold columns to be extracted. 1. MDR_REPORT_KEY 2. EVENT_KEY 6

7 3. REPORT_NUMBER 4. REPORT_SOURCE_CODE 5. MANUFACTURER_LINK_FLAG_ 6. NUMBER_DEVICES_IN_EVENT 7. NUMBER_PATIENTS_IN_EVENT 8. DATE_RECEIVED 9. ADVERSE_EVENT_FLAG 10. PRODUCT_PROBLEM_FLAG 11. DATE_REPORT 12. DATE_OF_EVENT 13. REPROCESSED_AND_REUSED_FLAG 14. REPORTER_OCCUPATION_CODE 15. HEALTH_PROFESSIONAL 16. INITIAL_REPORT_TO_FDA 17. DATE_FACILITY_AWARE 18. REPORT_DATE 19. REPORT_TO_FDA 20. DATE_REPORT_TO_FDA 21. EVENT_LOCATION 22. DATE_REPORT_TO_MANUFACTURER 23. MANUFACTURER_CONTACT_T_NAME 24. MANUFACTURER_CONTACT_F_NAME 25. MANUFACTURER_CONTACT_L_NAME 26. MANUFACTURER_CONTACT_STREET_1 27. MANUFACTURER_CONTACT_STREET_2 28. MANUFACTURER_CONTACT_CITY 29. MANUFACTURER_CONTACT_STATE 30. MANUFACTURER_CONTACT_ZIP_CODE 31. MANUFACTURER_CONTACT_ZIP_EXT 32. MANUFACTURER_CONTACT_COUNTRY 33. MANUFACTURER_CONTACT_POSTAL 34. MANUFACTURER_CONTACT_AREA_CODE 35. MANUFACTURER_CONTACT_EXCHANGE 36. MANUFACTURER_CONTACT_PHONE_NO 37. MANUFACTURER_CONTACT_EXTENSION 38. MANUFACTURER_CONTACT_PCOUNTRY 39. MANUFACTURER_CONTACT_PCITY 40. MANUFACTURER_CONTACT_PLOCAL 41. MANUFACTURER_G1_NAME 42. MANUFACTURER_G1_STREET_1 43. MANUFACTURER_G1_STREET_2 44. MANUFACTURER_G1_CITY 45. MANUFACTURER_G1_STATE_CODE 46. MANUFACTURER_G1_ZIP_CODE 47. MANUFACTURER_G1_ZIP_CODE_EXT 48. MANUFACTURER_G1_COUNTRY_CODE 49. MANUFACTURER_G1_POSTAL_CODE 50. DATE_MANUFACTURER_RECEIVED 51. DEVICE_DATE_OF_MANUFACTURE 52. SINGLE_USE_FLAG 53. REMEDIAL_ACTION 7

8 54. PREVIOUS_USE_CODE 55. REMOVAL_CORRECTION_NUMBER 56. EVENT_TYPE 57. DISTRIBUTOR_NAME 58. DISTRIBUTOR_ADDRESS_1 59. DISTRIBUTOR_ADDRESS_2 60. DISTRIBUTOR_CITY 61. DISTRIBUTOR_STATE_CODE 62. DISTRIBUTOR_ZIP_CODE 63. DISTRIBUTOR_ZIP_CODE_EXT 64. REPORT_TO_MANUFACTURER 65. MANUFACTURER_NAME 66. MANUFACTURER_ADDRESS_1 67. MANUFACTURER_ADDRESS_2 68. MANUFACTURER_CITY 69. MANUFACTURER_STATE_CODE 70. MANUFACTURER_ZIP_CODE 71. MANUFACTURER_ZIP_CODE_EXT 72. MANUFACTURER_COUNTRY_CODE 73. MANUFACTURER_POSTAL_CODE 74. TYPE_OF_REPORT 75. SOURCE_TYPE 76. DATE_ADDED 77. DATE_CHANGED foidevxxxx.zip (Device Data) file where xxxx is the year contains following 45 fields, delimited by pipe ( ), one record per line: 1. MDR Report Key 2. Device Event key 3. Implant Flag -- D6, new added; Date Removed Flag -- D7, new added; 2006; if flag in M or Y, print Date U = Unknown A = Not available I = No information at this time M = Month and year provided only, day defaults to 01 Y = Year provided only, day defaulted to 01, month defaulted to January 5. Device Sequence No -- from device report table 6. Date Received (from mdr_document table) SECTION-D 7. Brand Name (D1) 8. Generic Name (D2) 9. Manufacturer Name (D3) 10. Manufacturer Address 1 (D3) 11. Manufacturer Address 2 (D3) 12. Manufacturer City (D3) 13. Manufacturer State Code (D3) 14. Manufacturer Zip Code (D3) 15. Manufacturer Zip Code ext (D3) 16. Manufacturer Country Code (D3) 8

9 17. Manufacturer Postal Code (D3) 18. Expiration Date of Device (D4) 19. Model Number (D4) 20. Catalog Number (D4) 21. Lot Number (D4) 22. Other ID Number (D4) 23. Device Operator (D5) 24. Device Availability (D10) Y = Yes N = No R = Device was returned to manufacturer * = No answer provided 25. Date Returned to Manufacturer (D10) 26. Device Report Product Code 27. Device Age (F9) 28. Device Evaluated by Manufacturer (H3) Y = Yes N = No R = Device not returned to manufacturer * = No answer provided BASELINE SECTION (for records prior to 2009) 29. Baseline brand name 30. Baseline generic name 31. Baseline model no 32. Baseline catalog no 33. Baseline other id no 34. Baseline device family 35. Baseline shelf life contained in label Y = Yes N = No A = Not applicable * = No answer provided 36. Baseline shelf life in months 37. Baseline PMA flag 38. Baseline PMA no 39. Baseline 510(k) flag 40. Baseline 510(k) no 41. Baseline preamendment 42. Baseline transitional 43. Baseline 510(k exempt flag 44. Baseline date) first marketed 45. Baseline date ceased marketing foidevproblem (Device Data for foidev problem) contains following 2 fields, delimited by pipe ( ), one record per line: 1. MDR Report Key 2. Device Problem Code -- (F10) new added; DEVICEPROBLEMCODES contains following 2 fields, delimited by pipe ( ), one record per line: 9

10 1. Device Problem Code 2. Problem Description 3.3 R Shiny User Interface The user interface is built using R shiny web application a very simple plain layout showcasing the visualization, classification tree and forecasted data. Screen shots have been provided below to demonstrate the user interface. Visualization For visualization of number of events reported in every state year wise, a thematic map Chloropleth is used. The shading in the map represents the range of events occurred in every state. 10

11 Forecasting 4. Reusable Components Shell Scripting To crawl MAUDE data from FDA site and load Data into Hadoop environment Big Data Storage and Preprocessing HDFS To Store the raw data which downloaded and Preprocessed Data which is integrated and to be used for reporting as well as model building. HIVE Created Hive tables corresponding to master data, Device data, Device problem codes data and Device problem data. Using the common key, tables have merged together to form final datasets to be used for model building. In the process of preparing final datasets we have created custom UDF s also in HIVE to get data in required format. 11

12 Analytics R For building a classification tree and forecasting model R Web Interface using Shiny To create Interactive Light Weight Web application which can be accessible on Intranet / Corporate LANs. 12

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Big Data Hadoop Developer Course Content Who is the target audience? Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Complete beginners who want to learn Big Data Hadoop Professionals