1 University of Pittsburgh, Pittsburgh, PA,USA; 2 University of Manchester, Manchester,UK

Size: px

Start display at page:

Download "1 University of Pittsburgh, Pittsburgh, PA,USA; 2 University of Manchester, Manchester,UK"

Jonathan Horace Shepherd
5 years ago
Views:

1 Towards Computational Extraction of Potential Drug-Drug Interaction Information from Drug Product Labeling Tables Steven T. DeMarco 1, Nikola Milošević, MSc 2, Joshua Le 1, Jodi Schneider, PhD 1, Richard D. Boyce, PhD 1 1 University of Pittsburgh, Pittsburgh, PA,USA; 2 University of Manchester, Manchester,UK Abstract Structured Product Labels (SPLs) are mandated to provide information about potential drug-drug interactions (PDDIs). A major limitation of SPLs is that the information is provided as unstructured text and tables. Extracting, storing, processing, and annotating this information into an indexed knowledge base will allow increased accessibility to important prescribing information. In this paper we report on an analysis of the feasibility of automatically extracting PDDI information from tables found within the Drug Interactions section of SPLs. 1,161 SPLs (3.9% of prescription labeling) had a total 1,530 tables. These tables had 340 headers that we grouped into 8 categories. Both functional and structural analyses, and MetaMap annotation, was completed for 50% of the 1,530 tables. We observed that most tables were diversely structured, and that the most frequent semantic types aligned well with the 8 table header categories. The results provide a starting point for developing heuristics to extract PDDIs. Introduction Many thousands of people are harmed each year by exposure to two or more drugs for which there exists a known potential interaction 1. Unfortunately, there is currently no single complete source of information for these potential drug-drug interactions (PDDIs) 2. In the United States, one important information source is drug product labeling, which is required by law to contain information regarding clinically significant interactions (c.f., CFR (c)) 3. All drug product labels in the United States are freely available through the National Library of Medicine s DailyMed website 4 in a standard format called Structured Product Labeling (SPL) 5. While easy to access, a major limitation of SPLs is that information regarding PDDIs is provided as unstructured text and diversely formatted tables. Providing a solution to computationally extract PDDI information from the label into an indexed knowledge base would enable more convenient access to this information. Moreover, PDDI information extracted from the SPL could be more easily linked to other sources of information to provide a more complete picture of the mechanisms, risk factors, clinical implications, and management options of each PDDI. In this paper we report on an analysis of the feasibility of automatically extracting PDDI information from tables found within the Drug Interactions section of SPLs. In the remainder of this paper, we introduce background on text mining of tables and drug-related information, describe the methods for our analysis of SPL tables, and present the analysis results and conclusions. Background Text mining information within published tables is an active area of computer science research. Hurst was among the first to propose a text mining model of tables containing five elements: graphical, physical, functional, structural and semantic 6. Research has focused on detecting tables in various types of documents and assigning functional areas (or conducting functional analysis) of tables Functional analysis aims to classify table cells by their function and can be carried out using heuristic rules or machine learning algorithms After the functional areas are found, the relationships between cells can be resolved during structural analysis, which is usually done by applying heuristic rules 6,13,15. Applying text mining to biomedical tables is an emerging area. To extract gene mutation names, Wong et al. scraped PubMed s stand-alone HTML table documents (avoiding the need for table detection) and used heuristics to detect and classify table headers 16. Then they used the Mismatch Repair Database as a gold standard with a machinelearning based Named Entity Recognition system, to identify which table vector contained at least one mutation mention. The BioTextSearch project indexed tables from PubMedCentral, storing complete tables and images in an Apache Lucene-based search engine; their table search heavily weighted captions and headers to improve table retrieval compared to text search 17,18. Within the drug informatics domain, the SPLICER system 19 was successfully applied to extracting adverse drug events from tables and text written in the Adverse Reactions section of SPLs. Other efforts focused on side effects

2 and drug indications For example, the SIDER database uses named entity recognition to extract side effects and indications from product labeling, including SPLs 23. More recently, starting with full-text papers from the Journal of Oncology, Xu & Wang extracted drug-side effect relationships, which they compare to the SIDER database. They used Support Vector Machines to classify tables as side-effect-related or not, and then used a dictionary-based approach to extract drugs and side effects based on manually curated lexicons 24. Methods Descriptive Analysis of Table Content In order to develop a thorough understanding of the distribution of rows, columns and table headers within SPL tables, we first conducted a descriptive analysis. We wrote a Python script 25 to automatically download the Drug Interaction section for all US SPLs from the LinkedSPLs 26 resource, and to separate HTML tables from unstructured text content. All extracted HTML tables were merged into a single HTML page that we could view in a web browser to get an overall sense of the kinds of information provided by the tables. Also, a Python script was used to extract all table headers which we then manually organized into distinct categories. These categories were then used to describe the distribution of categorical assignment given to unique table headers. Table Content Extraction and Storage for Convenient Access The descriptive analysis made it apparent that tables with similar table headers contained similar information. This information regularly included drug mentions, clinical recommendations, and dosing instructions or adjustments. The next step was to enable computational access to data provided in the individual cells of each table. Doing so would enable semantic annotation of the table content which could then be used to extract PDDI information. We modified an existing software tool called Table Annotator 27 to extract the contents of each SPL table into its relational schema which is designed to store each table, cell, and semantic annotation. The schema also supports storing the results of the functional and structural analyses over the table data. After modifying Table Annotator, SPLs for all prescription drug products as of January 1, 2016 were downloaded from DailyMed. The full set of SPLs was reduced to a subset of SPLs that our descriptive analysis identified as having table content in the Drug Interaction section. These SPLs were used as input into Table Annotator which parsed and analyzed the table content, assigned functional roles and structural relationships to individual cells and annotated the contents of each cell. Functional Analysis A functional analysis determines each cell s function within each table. Cells are identified as a table header, row header, super-row, or data cell. Table Annotator implements an algorithm that determines these cellular functions based on a set of heuristic rules 27. Briefly, table headers are typically found within the first row of the table and are identified by the <thead> tag. These table headers provide insight into the data contained within the corresponding column. Similarly, each row heading classifies the information contained in the corresponding row. Some row headers are further classified as super-rows, which grouped related rows together. Each data cell corresponds to an individual data item that has a relationship with other functional groups, such as table headers and row headers. An example of the output of a typical functional analysis can be seen in Figure 1.

Figure 1. Functional analysis on the table Efficacy by Disease Site 28. Structural Analysis A structural analysis determines the relationships between cells.

3 Figure 1. Functional analysis on the table Efficacy by Disease Site 28. Structural Analysis A structural analysis determines the relationships between cells. Table types can fall into four types based on their dimensionality and structure, and each table type requires different parsing strategies. The four table types are list tables, matrix tables, super-row tables and multi-tables. List tables are identified as one-dimensional tables that present a simple list of data cells. These data cells are involved in a relationship only with the header above them. Matrix tables are identified as simple two-dimensional tables that contain data cells within a matrix. These data cells are involved in relationships with their corresponding table header and row header. Super-row tables are identified as multi-dimensional tables, where dimensions are defined by the number of table headers, row headers and superrows. Data cells within this matrix are involved in relationships with the table headers above the corresponding cells, as well as the corresponding row headers and super-row headers. Multi-tables are comprised of multiple tables merged together. These tables can be a challenge during computational analysis because only table headers within the first row are formally labeled with a <thead> tag, while the rest of the headers are recognized as cells between rows containing a horizontal line. An example of a table containing problematic headers can be seen in Figure 2.

4 Figure 2. Drug-Thyroidal Axis Interactions is an example of a table that would be classified as a multi-table 29. Annotation of Cell Content After Table Annotator performed a functional and structural analysis of the input SPLs, the system was used to annotate the cell content of each loaded table. Internally, the Table Annotator uses Marvin, a semantic text annotation tool, to annotate cell content 30. We configured Marvin to use the Unified Medical Language System s (UMLS) MetaMap program to identify named entities within the table cells 31,32. Table Annotator stored the MetaMap annotations as Concept Unique Identifiers (CUIs) linked to data from specific tables cells 33. The UMLS Semantic Network provides a semantic type for each CUI, such as Pharmacologic Substance, Clinical Attribute or Therapeutic or Preventative Procedure. We queried the Table Annotator schema for the CUIs associated with each MetaMap annotation, and then queried our local implementation of the UMLS database (2014AB) to retrieve the corresponding semantic types and describe their distribution. Results Descriptive Analysis of Table Content Tables 1 and 2 summarize our descriptive analysis of table content. Out of 29,964 prescription drug SPLs, only 1,161 SPLs (3.9%) had tables present in the Drug Interaction section. These 1,161 SPLs included a total 1,530 tables. From these tables, we identified 340 unique table headers which we then grouped into 8 different categories. We computationally identified the most prominent categories as Drug Class or Drug Name, Effect on Drug and Recommendation or Comment. Table 1. Basic statistics for 1,530 HTML tables. Property Count Property Minimum Median Mean Maximum Number of Tables 1,530 Columns Number of Distinct Headers 340 Rows Number of Categories 8

Table 2. Number of Distinct Headers per Category for 1,530 HTML tables. Category Name Count Proportion of Unique Headers Drug Class or Drug Name 101 29.71% Effect on Drug 38 11.

5 Table 2. Number of Distinct Headers per Category for 1,530 HTML tables. Category Name Count Proportion of Unique Headers Drug Class or Drug Name % Effect on Drug % Interaction Properties % Interacting Substance % Interacting Substance Properties % Miscellaneous % Recommendation or Comment % Sample Size % Table Content Extraction and Storage for Convenient Access Due to technical difficulties (described below) that were not addressed due to time constraints, Table Annotator loaded and completed the functional, structural, and annotation tasks on only 772 (50%) of the 1,530 tables. The results below apply to that subset of the tables. Functional Analysis After completing the functional analysis, Table Annotator identified 35,407 cells which were classified into 43,179 cell types. The classifications were distributed as 7,558 table headers, 11,790 row headers, 20,256 data cells and 3,575 super-rows. Row headers can be further classified as super-rows (see Figure 1), which accounts for the offset in cell types. Structural Analysis After completing the structural analysis, Table Annotator identified 772 tables which were classified as 298 matrix tables and 474 super-row tables. Annotation of Cell Content The total number of distinct headers was 179, which fell into eight different categories. As with the full set of tables, the most prominent categories for these table headers were Drug Name or Drug Class, Effect on Drug and Recommendation or Comment. Figure 3 and Table 3 show example of MetaMap annotated content. In total, MetaMap identified 332,504 named entities which mapped to 107 semantic types. The two most frequent semantic types were Pharmacologic Substance and Organic Chemical. Tables 4 6 summarize the results of the Table Annotator processing for the tables for which MetaMap annotation was completed. Figure 3. An example of MetaMap annotated content (underlined). This row is an excerpt from Established and Other Potentially Significant Drug Interactions 34. Each of these annotations has a corresponding semantic type that is outlined in Table 3.

6 Table 3. The MetaMap annotations for the table excerpt shown in Figure 3. Each annotation can belong to multiple semantic types. Drugs or drug classes corresponded to Pharmacologic Substance and Organic Chemical semantic types, whereas content within the clinical recommendation corresponded to Functional, Quantitative or Qualitative Concept semantic types. MetaMap Annotation corticosteroid (C , C ) dexamethasone (C ) delavirdine (C ) use with caution (C ) rescriptor (C ) less (C , C ) effective (C , C ) due to (C ) decreased (C , C , C ) patients (C ) take (C ) agent (C , C , C ) concentrations plasma (C ) Corresponding Semantic Type(s) Pharmacologic Substance (T121), Hormone (T125), Organic Chemical (T109) Pharmacologic Substance (T121), Organic Chemical (T109) Pharmacologic Substance (T121), Organic Chemical (T109) Functional Concept (T169) Pharmacologic Substance (T121), Organic Chemical (T109) Quantitative Concept (T081), Qualitative Concept (T080) Quantitative Concept (T081), Qualitative Concept (T080) Functional Concept (T169) Quantitative Concept (T081), Qualitative Concept (T080) Patient or Disabled Group (T101) Health Care Activity (T058) Chemical Viewed Functionally (T120), Pharmacologic Substance (T121), Intellectual Product (T170) Quantitative Concept (T081) Table 4. Basic statistics for the 722 tables processed by MetaMap named entity recognition with Table Annotator. Property Count Property Minimum Median Mean Maximum Number of Tables 772 Columns Number of Distinct Headers Number of Categories Rows MetaMap Annotations

7 Table 5. Number of distinct headers per category for the 772 tables processed by MetaMap named entity recognition with Table Annotator. Category Name Count Proportion of Unique Headers Drug Class or Drug Name % Effect on Drug % Interaction Properties % Interacting Substance % Interacting Substance Properties % Miscellaneous % Recommendation or Comment % Sample Size % Table 6. Distribution of most frequent semantic types processed by MetaMap named entity recognition with Table Annotator. Conclusion Semantic Type Total Count Proportion of Total Annotations Pharmacologic Substance (T121) 60, % Organic Chemical (T109) 38, % Qualitative Concept (T080) 35, % Quantitative Concept (T081) 34, % Functional Concept (T169) 32, % Spatial Concept (T082) 15, % Intellectual Product (T170) 11, % Clinical Attribute (T201) 10, % Finding (T033) 9, % Therapeutic or Preventative Procedure (T061) 9, % To the best of our knowledge, no other project has attempted to extract PDDI information from tables in SPLs. We found that Table Annotator could load and conduct a functional and structural analysis of all Drug Interaction section tables. With respect to semantic annotations, while we ran into difficulties applying MetaMap to the cell content of many of the tables, we think the problem will be addressable in future work. MetaMap has strict requirements for the strings that it receives for processing and the raw cell content of SPL tables would frequently break these requirements causing the MetaMap server to crash. Some additional pre-preprocessing of the data sent to MetaMap from Table Annotator should fix this issue. Moving forward, we plan to complete an in-depth analysis on the causes of the MetaMap server crashing, pre-process these problematic SPLs, and fully process the 1,530 tables with Table Annotator. The current analysis characterizes the current functional and structural characteristics for a significant subset of tables of all tables in the Drug Interactions section. It also identified patterns of semantic types within the table cells for that subset. As a result of the structural analysis, the majority of tables analyzed were super-row tables, confirming our initial observation that many of these tables contain diverse and difficult-to-process formatting. Our functional analysis found that data cells were the most frequent cell type, with row headers as the second most common cell type. We observed that the most frequent semantic types aligned well with our categorical assignments for table headers. The most frequent table header categories were Drug Class or Drug Name, Effect on Drug and Recommendation or Comment. The most frequent semantic types (Table 6) contain annotations that correspond to these categorical header assignments. For example, Pharmacologic Substance and Organic Chemical semantic types corre-

8 spond with the Drug Class or Drug Name table header category. The Clinical Attribute, Finding and Therapeutic or Preventative Procedure semantic types correspond with the Recommendation or Comment table header category. Based on the results, we are confident that is feasible to construct scalable rules for extracting PDDI information from tables found within the Drug Interactions section of SPLs. This can be done using table header categorical assignments and the semantic types of corresponding annotations. By grouping similar tables based upon the pattern of both the semantic types and the table headers, we can begin to extract the individual pieces of PDDI information. The first step would be to test if the header categories for a table align with the semantic types from each cell annotation. This will confirm the fact that the table header contains relevant information in the data cells beneath it. The second step would be to test for features that indicate specific kinds of information within each of the columns defined by the table header categories. For example, if the cells within a column with Drug Class or Drug Name header contain an entity tagged with the Pharmacologic Substance and Organic Chemical semantic type then, each entity is likely referring to a drug that interacts with that drug that the SPL is about. Similarly, if the cells within a column with the Effect on Drug header type include entities tagged with a Functional, Quantitative or Qualitative Concept semantic type then, the cell likely provides information on the clinical effect (e.g., effect direction, magnitude, or severity). Other heuristics can be imagined and, taken together, will serve as a starting point for developing extracting PDDIs from most of the diverse PDDI tables found in SPLs. Acknowledgements This work was partially supported by the National Library of Medicine grant Addressing gaps in clinically useful evidence on drug-drug interactions (1R01LM ) and a training grant 5T15LM from the National Library of Medicine and National Institute of Dental and Craniofacial Research. References 1. Magro L, Moretti U, Leone R. Epidemiology and characteristics of adverse drug reactions caused by drugdrug interactions. Expert Opin Drug Saf Jan;11(1): Ayvaz S, Horn J, Hassanzadeh O, Zhu Q, Stan J, Tatonetti NP, et al. Toward a complete dataset of drug drug interaction information from publicly available sources. J Biomed Inform Jun;55: CFR - Code of Federal Regulations Title 21 [Internet]. Apr 1, Available from: 4. DailyMed - About [Internet]. [cited 2016 Mar 7]. Available from: 5. FDA. Indexing Structured Product Labeling [Internet]. Rockville, MD: Federal Drug Administration; 2008 Jun. Report No.: ucm Available from: f 6. Hurst MF. The Interpretation of Tables in Texts [Internet] [PhD]. University of Edinburgh; 2000 [cited 2016 Mar 7]. Available from: 7. Thomas Kieninger AD. The T-Recs Table Recognition and Analysis System. In: Document Analysis Systems: Theory and Practice. Nagano, Japan: Springer; p Ng HT, Lim CY, Koo JLT. Learning to Recognize Tables in Free Text. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics [Internet]. Stroudsburg, PA, USA: Association for Computational Linguistics; 1999 [cited 2016 Mar 7]. p (ACL 99). Available from: 9. Son JW, Lee JA, Park SB, Song HJ, Lee SJ, Park SY. Discriminating Meaningful Web Tables from Decorative Tables Using a Composite Kernel. In: IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, p Yildiz B, Kaiser K, Miksch S. pdf2table: A Method to Extract Table Information from PDF Files. In: Indian International Conference on Artificial Intelligence. Tumkur, Karnataka, India; p Hu J, Kashi RS, Lopresti DP, Wilfong G. Table structure recognition and its evaluation. In: Proc SPIE, Document Recognition and Retrieval VIII [Internet] [cited 2016 Mar 7]. p Available from: Chavan MM, Shirgave SK. A Methodology for Extracting Head Contents from Meaningful Tables in Web Pages. In: 2011 International Conference on Communication Systems and Network Technologies (CSNT) p Wei X, Croft B, Mccallum A. Table Extraction for Answer Retrieval. Inf Retr Nov;9(5):

9 14. Tengli A, Yang Y, Ma NL. Learning Table Extraction from Examples. In: COLING 04: Proceedings of the 20th International Conference on Computational Linguistics [Internet]. Stroudsburg, PA, USA: Association for Computational Linguistics; 2004 [cited 2016 Mar 7]. Available from: Milosevic N, Gregson C, Hernandez R, Nenadic G. Extracting patient data from tables in clinical literature: Case study on extraction of BMI, weight and number of patients. In: Proceedings of the 9th International Joint Conference on Biomedical Engineering Systems and Technologies [Internet]. Rome, Italy; 2016 [cited 2016 Mar 7]. p Available from: on_extraction_of_bmi_weight_and_number_of_patients 16. Wong W, Martinez D, Cavedon L. Extraction of named entities from tables in gene mutation literature. In: Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing [Internet]. Association for Computational Linguistics; 2009 [cited 2016 Mar 9]. p Available from: Hearst MA, Divoli A, Guturu H, Ksikes A, Nakov P, Wooldridge MA, et al. BioText Search Engine: beyond abstract search. Bioinformatics Aug 15;23(16): Divoli A, Wooldridge MA, Hearst MA. Full Text and Figure Display Improves Bioscience Literature Search. PLOS ONE Apr 14;5(4):e Duke J, Friedlin J, Li X. Consistency in the safety labeling of bioequivalent medications. Pharmacoepidemiol Drug Saf Mar 1;22(3): Fung KW, Jao CS, Demner-Fushman D. Extracting drug indication information from structured product labels using natural language processing. J Am Med Inform Assoc May 1;20(3): Khare R, Li J, Lu Z. LabeledIn: Cataloging labeled indications for human drugs. J Biomed Inform Dec;52: Boyce R, Gardner G, Harkema H. Using Natural Language Processing to Extract Drug-Drug Interaction Information from Package Inserts. In: BioNLP: Proceedings of the 2012 Workshop on Biomedical Natural Language Processing [Internet]. Montréal, Canada: Association for Computational Linguistics; p Available from: Kuhn M, Letunic I, Jensen LJ, Bork P. The SIDER database of drugs and side effects. Nucleic Acids Res Jan 4;44(Database issue):d Xu R, Wang Q. Large-scale automatic extraction of side effects associated with targeted anticancer drugs from full-text oncological articles. J Biomed Inform Jun;55: Python [Internet]. Python.org. [cited 2016 Mar 9]. Available from: dbmi-pitt/ner-pddi-table-parsing [Internet]. GitHub. [cited 2016 Mar 10]. Available from: TableAnnotator [Internet]. GitHub. [cited 2016 Mar 7]. Available from: DailyMed - LETROZOLE- letrozole tablet [Internet]. [cited 2016 Mar 7]. Available from: DailyMed - LEVOTHYROXINE SODIUM - levothyroxine sodium tablet [Internet]. [cited 2016 Mar 7]. Available from: Milosevic N. Marvin: Semantic annotation using multiple knowledge sources. ArXiv Cs [Internet] Feb 1 [cited 2016 Mar 7]; Available from: UMLS Quick Start Guide [Internet]. [cited 2016 Mar 7]. Available from: MetaMap - A Tool For Recognizing UMLS Concepts in Text [Internet]. [cited 2016 Mar 7]. Available from: McInnes BT, Pedersen T, Carlis J. Using UMLS Concept Unique Identifiers (CUIs) for Word Sense Disambiguation in the Biomedical Domain. AMIA Annu Symp Proc. 2007;2007: DailyMed - RESCRIPTOR- delavirdine mesylate tablet [Internet]. [cited 2016 Mar 7]. Available from:

Extracting patient data from tables in clinical literature Case study on extraction of BMI, weight and number of patients

Extracting patient data from tables in clinical literature Case study on extraction of BMI, weight and number of patients Nikola Milosevic 1, Cassie Gregson 2, Robert Hernandez 2 and Goran Nenadic 1,3