Development of interactive statistical modules and workflows for exploration of diverse sets of crop phenotyping data

Size: px

Start display at page:

Download "Development of interactive statistical modules and workflows for exploration of diverse sets of crop phenotyping data"

Pamela Casey
5 years ago
Views:

1 Southern Cross University Theses 2017 Development of interactive statistical modules and workflows for exploration of diverse sets of crop phenotyping data Sadaf Naz Southern Cross University Publication details Naz, S 2017, 'Development of interactive statistical modules and workflows for exploration of diverse sets of crop phenotyping data', MSc thesis, Southern Cross University, Lismore, NSW. Copyright Crops For the Future (CFF) 2017 epublications@scu is an electronic repository administered by Southern Cross University Library. Its goal is to capture and preserve the intellectual output of Southern Cross University authors and researchers, and to increase visibility and impact through open access to researchers around the world. For further information please contact epubs@scu.edu.au.

2 DEVELOPMENT OF INTERACTIVE STATISTICAL MODULES AND WORKFLOWS FOR EXPLORATION OF DIVERSE SETS OF CROP PHENOTYPING DATA SADAF NAZ Master s thesis Supervisor: Dr. Abdul Baten Co-supervisor: Prof. Graham King September

3 This page is intentionally left blank. 2

4 Table of Contents ABSTRACT SUMMARY CHAPTER 1: INTRODUCTION Role of data repositories for plant research and breeding Scope of existing databases relevant to crop genetics and pre-breeding R&D Plant genetic resources and description of experimental materials Phenotypic trait and trial data Data quality, meta-data and consistency Data access, querying, and retrieval Downstream analysis of crop genetics and related data sets Offline analysis tools Online analysis tools Summary and conclusion Research gaps Aims Objectives Rationale Experimental approach CHAPTER 2: DATA PREPARATION Introduction InterStoreDB CropStoreDB Materials and methods Data processing issues and solutions Results Overview

5 2.3.2 Object entity Technical Implementation Success measurement Discussion CHAPTER 3: R SHINY APPLICATIONS Introduction Materials and Methods CS_PhenEXPLORER CropStoreDB trait phenotypic data exploration CS_PhenNAVIGATOR CropStoreDB trait phenotypic data navigation and distribution CS_DATACOMP CropStoreDB trait phenotypic data comparison CS_DATAVISAN CropStoreDB trait phenotypic data pivot table, visualisation and analysis Results CS_PhenEXPLORER Features of CS_PhenEXPLORER Functionalities of CS_PhenEXPLORER CS_PhenNAVIGATOR Features of CS_PhenNAVIGATOR Functionalities of CS_PhenNAVIGATOR CS_DATACOMP Features of CS_DATACOMP Functionalities of CS_DATACOMP CS_DATAVISAN Features of CS_DATAVISAN Functionalities of CS_DATAVISAN Analysis tool interface for Shiny apps integration Discussion

6 CHAPTER 4: CONCLUSIONS AND FUTURE WORK Conclusions Future work REFERENCES SUPPLEMENTARY MATERIALS

7 List of tables Table 1.1: Genetic resources and description of experimental materials. Table 1.2: Summary of few of the current capabilities of statistical analysis methods for crop phenotypic data. Table 2.1: Examples of crop genetics plant database resources, management and data exchange systema. Table 2.2: Description of population data entities from CropStoreDB. Source: [URL 1] Table 2.3: Description of trait data entities from CropStoreDB. Source: [URL 1] Table 2.4: Number of retrieved rows, and tables and indexes creation time to get PhenDATA table. Table 2.5: Other data preparation tasks, effected rows and processing time. Table 3.1: A list of crop plant R&D databases offers phenotypic data analysis tools. Table 3.2: Frequently used concepts to build Shiny applications are explained. Table 3.3: A snapshot of the first 10 entries of selected column of the PhenDATA table fetched from MySQL CropStoreDB database [URL 1]. Source code: S1.10 Table 3.3: Master table data description for each design factors country, plant_population, project_descriptor, species, trial_year, and variate descriptor_name is displayed. Note: CHN, GBR & PRT are China, Great Britain and Portugal respectively. Table 3.5: List of user interactive features of CS_PhenEXPLORER provided as additional inputs. Table 3.6: List of user interactive features of Graph options tab of CS_PhenEXPLORER. Table 3.7: List of controls, save plot and data download features of CS_PhenEXPLORER. Table 3.8: List of controls, save plot and data download features of CS_PhenNAVIGATOR. Table 3.9: List of controls, save plot and data download features of CS_DATACOMP. Table 3.10: List of control, save plot and data download features of CS_DATAVISAN. 6

8 List of figures Figure 2.1: Generic data flow framework for developing a crop genetics database at the local level and sharing the data at the global level. (Redrawn from (Ningthoujam et al., 2012). Acronyms used for global repositories associated with plant DB s: EBI, European Bioinformatics Institute; NCBI, National Centre for Biotechnology Information; TAIR, The Arabidopsis Information Resource; GBIF, Global Biodiversity Information Facility; NBII, National Biological Information Infrastructure. Figure 2.2: InterStoreDB databases schema. Source: URL 10 Figure 2.3: System architecture of InterStoreDB with three core databases with the division of captured and associated meta-data along with additional interfaces providing navigable links between databases. Source: (Love et al. 2012) Figure 2.4: The entity relationship diagram for CropStoreDB database Source: [URL 10]. Figure 2.5: Different sections superimposed on entity relationship diagram for the CropStoreDB database. Source: [URL 10] Figure 2.6: CropStore interface of CropStoreDB within Brassica implementation. Source: [URL 11] Figure 2.7: Trait phenotypic data section consists of population and trait data. Source: [URL 10] Figure 2.8: Use case diagram to construct the master table (the PhenDATA table). Figure 2.9: Table names are represented in shaded regions, whereas primary keyforeign key relationships are used to join table1, table2, table3 and table4. Table 12, table 34 and table 1234 are joined using common columns. From left to right, follow the sequence to construct tables (Table1, Table2, Table3, Table4, Table12 andtable34). Table1234 is a master table. Figure 3.1: A generic structure of Shiny UI along with few basic features is displayed. Full user interface display is a fluid page. Light gry area is sidebar panel and rest is the main panel. Sidebar panel is usually for inputs, whereas the main panel is for outputs. Tab set panels can be used in both sidebar and main panels. Within tab set panel more than one tab panels can be designated e.g. inputs and help are two tab panels within tab set panel of sidebar panel and plot, summary and data are three tab panels within tab set panel of the main panel. Figure 3.2: The CS_PhenEXPLORER- UI is displayed. Web page displaying two panels, sidebar panel on left side of the page (light grey) and main panel on the right side of the page. Sidebar panel has three tabs, inputs, graph options and help, whereas the main panel has two tabs, i.e. plot and data. Inputs, such as data selection, variable selection, filter and categories options are provided in sidebar panel. Main panel Plot tab is displaying visual of explored data after setting Colour By, 7

9 Column Split and Fill By options provided in the main panel of the web page. Superimposed circled numbers present the workflow and help user to understand and reproduce case study output. Figure 3.3: The CS_PhenEXPLORER (Case study output) is displayed. Explored data trends can be judged in a glimpse from this display. The size of the diamond shapes representing project_descriptors. Brassica species napus is studied in relatively most recent years mostly in China for the IMSORB project. All three species has been studied in Great Britain. The OREGIN project studied only B. napus in Host response to Peronospora parasitica, Albugo candida and Brevicoryne brassicae were studied quite a long ago in 1996 with no subsequent records in the CropStoreDB for Brassica database [URL 1]. In the UK, B. oleracea was the most frequently studied species in recent years for Mineral Analysis project. Figure 3.4: CS_PhenNAVIGATOR - UI is displayed. Web page showing two panels, sidebar panel on left side of the page (light grey) and main panel on the right side of the page. Sidebar panel has two tabs, data distribution and help, whereas the main panel presenting the distribution of all score_values and trait phenotypic data table. The PhenDATA table is fetched from CropStoreDB [URL 1], it allows global search and offers data filter box for each variate and design factor. User filter selection(s) will promptly update the data table and hit update plot button to visualise updated distribution plot. Figure 3.5: CS_PhenNAVIGATOR - UI output of case study is displayed. After setting two filters napus and flowering time, first 5 entries of filtered data were displayed in a table. Distribution of the counts of the filtered score_values was shown in the main panel. Download filtered data button was download filtered results in.csv file with additional design factors not displayed in UI. Hover on the plot to see zooming, tooltips, panning and download options. Superimposed circled numbers present the workflow and help user to understand and reproduce case study output. Figure 3.6: CS_DATACOMP - UI is displayed. Web page showing two panels, sidebar panel on left side of the page (light grey) and main panel on the right side of the page. Sidebar panel has two tabs, inputs and help, whereas the main panel is dedicated for plot display and data table. By default, score_value and trial_year are selected as variable and groups respectively. User can use drop down menu to choose variable, group and plot type. Under current scenario, score_value was fixed for variable selection. Histograms for flowering time in year 2003 and 2004 were compared. Both years have bimodal data distribution data for flowering time was more spread out and higher scored as compared to Further investigations often reveal the reason of bimodal shapes. Figure 3.7: CS_DATACOMP - UI output of case study is displayed. First data subset was filtered for flowering time and erucic acid content. descriptor_name was selected as group for a boxplot. Unchecked show points option to get colored boxes without points. Data for erucic acid content was highly negatively skewed, whereas flowering time was slightly positively skewed. Filtered data subset corresponding to selected descriptor_names was used for plotting and displayed in a table. 8

10 Superimposed circled numbers present the workflow and help user to understand and reproduce case study output. Figure 3.8: CS_DATAVISAN - UI is displayed. Web page showing sidebar panel on left side of the page (light grey) and main panel on the right side of the page. Sidebar panel has two tabs, inputs and help. The main panel also has two tabs for the pivot table and data summary. Variables from the given list of variables can be dragged and dropped to vertical and horizontal blue coloured panels of the pivot table. First view of the application is only showing total number of records in the PhenDATA table of CropStoreDB database [URL 1].Pivot table can be downloaded as.csv file format by clicking Download Pivot Table button. Table option is selected by default which can be replaced from list of multiple plotting options, such as, heat map, bar chart, stacked bar chart, etc. Count has been set as a method of aggregation to populate pivot table with the score_values. Score_value is the only numeric column in the PhenDATA table. Data summary tab offers statistical summary of the pivot table. Superimposed circled numbers present the workflow and help user to understand and reproduce case study output. Figure 3.9: CS_DATAVISAN - Case study output is displayed. Within pivot table bar graph of counts vs species by trial_year indicates that napus is most studied species particularly in Figure 3.10: CS_DATAVISAN - Case study output is displayed. Within pivot table, stacked bar chart of counts vs species by trial_year indicates that almost similar amount of work has been in done in olerecea each year from 2003 to Overall, napus is most studied species. Figure 3.11: Conceptual framework of data sharing, visualisation and publishing system after deployment of Shiny apps is displayed. The PhenDATA table is prepared from the CropStoreDB database [URL 1] trait phenotypic data section. The PhenDATA table is used in all apps. Analysis tool interface will be part of existing CropStoreDB database [URL 1] where each tab CS_PhenEXPLORER, CS_PhenNAVIGATOR, CS_DATACOMP and CS_DATAVISAN based on distinct Shiny apps. 9

11 This page was intentionally left blank. 10

12 ACKNOWLEDGEMENTS First of all, I am grateful to The Almighty Allah (God), for enabling me to complete this thesis. This thesis would not have been possible without funding from the Crops For the Future (CFF) and the collaboration between CFF and Southern Cross Plant Science played a pivotal role. I wish to express my sincere thanks to my supervisor, Dr. Abdul Baten for every bit of his effort, time, support and encouragement. I would also like to thank my cosupervisor, Prof. Graham King for believing in me and offering me this interesting project without knowing my strengths and weaknesses. Thank you for your tremendous support, guidance and boundless energy throughout the period. It has been an absolute pleasure and a privilege to work with you all. Thanks to all the faculty member of Southern Cross Plant Science, for being helpful and supportive, whenever asked for, especially, Dr. Carolyn Raymond, Dr. Bronwyn Barkla and the PhD student, Mathew welling. I consider myself extremely fortunate to have had good support throughout my academic career, back in Pakistan and Australia. I would like to thank all faculty members of Statistics Department of Postgraduate College Jhang Punjab, Pakistan, and school of Information Technology and Mathematical Sciences, The University of South Australia for building my statistical concepts and computer skills which helped me enormously to complete this thesis. Finally, to my wonderful family, I can't thank you enough for always being there for me. 11

13 This page was intentionally left blank. 12

14 ABSTRACT The objective of the research project was to develop and refine interactive statistical analysis tools for a generic crop genetics data schema. A wide range of public domain databases have been generated to manage data describing different aspects of plant characteristics, taxonomy, biology, genetics and genomics. These data sets can be used to underpin cultivar development as well as to assist in genetic resource conservation, breeding and pre-breeding. Shortage of precise phenotypic data analysis tools is a key limiting step in rapid crop improvement that makes use of all available germplasm. A parsimonious but comprehensive data structure with a rich trait phenotypic contents along with easy and open access to datasets and interactive tools, is desirable to encourage interaction between researchers and breeders. Moreover, informed access to integrated datasets, will help develop a deeper understanding of phenotypic characteristics. In this project four tools were developed using open source software R and Shiny for crop phenotypic data exploration, navigation, comparison and analysis. The CropStoreDB (Love et al. 2012) database was used as an example database to implement those tools. CropStoreDB is a MySQL based database which manages genetics data and is equipped with raw datasets, relevant metadata, versions and descriptions. These tools are generic and can be implemented using other similar databases and plant species, for example, wheat, corn, etc. Open source software was being used for the development of interactive workflows for plant trait and phenotypic data. This enhances data navigation and offers real-time analysis tools. The interactive analysis toolkit developed will enable researchers and crop plant breeders to understand how existing varieties compare with available variation in the underlying genepool, and contribute to efficient breeding of new varieties with improved quality and adaptation to growing environment. The tools are intended to serve as an adaptable blueprint for future flexible interfaces which will be developed as part of CropStoreDB [URL 1]. 13

15 This page was intentionally left blank. 14

16 SUMMARY Chapter 1 is a comparative study which focuses on the role and scope of existing databases relevant to crop genetics and pre-breeding R&D, and addresses major challenges in setting up and maintaining crop plant repositories, plant genetic resources and description of experimental materials, along with access, querying and retrieval of phenotypic trait and trial data. It also discusses offline and online analysis tools for downstream analysis of crop genetics and related data sets. Finally, it identifies the research gaps, sets the aims and objectives, and defines the rationale and experimental approach to fill this gap. Chapter 2 describes generic structure of the data flow framework and crop genetics data repositories implemented by MySQL relational database management system (RDBMS). The primary goal was to introduce CropStoreDB database [URL 1] entities. Phenotypic data from 15 tables was combined into a single master table (the PhenDATA table) for analysis. This chapter also describes problems during data preparation and recommends solutions. CropStoreDB database [URL 1] is open source. CropStoreDB schema and MySQL dump file are available [URL 2]. MySQL queries to generate the PhenDATA table can be found in supplementary materials (see S1, S1.1, S1.2, S1.3, S1.4, S1.5 and S1.6). Chapter 3 covers the core objectives of this project. Thereby, it is extensive and broken down into four Shiny applications with distinct features and functionalities, i.e. CS_PhenEXPLORER, CS_PhenNAVIGATOR, CS_DATACOMP and CS_DATAVISAN. CS_PhenEXPLORER (3.3.1) offers data exploration and download from the CropStoreDB database. CS_PhenNAVIGATOR (3.3.2) allows data navigation using the PhenDATA table and options for filtered data download along with those design factors from the CropStoreDB which are not displayed in the web interface. It also allows on the fly visualisation of filtered data distribution trends in trait phenotypic data from CropStoreDB. CS_DATACOMP (3.3.3) offers comparisons of trait phenotypic data from CropStoreDB (the PhenDATA table) using statistical plots, such as density plots, boxplots, histograms and bar plots. CS_DATAVISAN (3.3.4) prepares trait phenotypic data from CropStoreDB (the PhenDATA table) in 15

17 the form of a pivot table for analysis. CS_DATAVISAN not only converts long formatted data into wide format, but also populates the table with score values. The within pivot table visualisations and summary statistics are also provided. All Shiny applications are generic and distinct and are presented as standalone tools. Shiny applications also have the scope to merge into one application if desired. Chapter 4 discusses the project scope, challenges, limitations and future work. This chapter aims to pool together the results derived from previous chapters and to address the core objectives. Overall, this project provides a good foundation to build on for future multivariate analysis approaches. This project will also facilitate the discovery of trait associations between phenotypic data subsets. 16

18 CHAPTER 1: INTRODUCTION The Food and Agricultural Organisation of the United Nations (FAO) predicts that a 70% increase in food production is required in order to meet the needs of 9.1 billion people by 2050 (de Sousa et al. 2016). Furthermore, due to rapidly increasing population, residential areas are expected to triple in developing countries by 2030 which ultimately will translate into a loss of arable land (Liu et al. 2014). Many organisations worldwide are working on different collaborative and/or independent projects with the aim to eradicate hunger and poverty. To meet these rapidly increasing demands, plant scientists require accessible high quality data and analysis tools in order to develop efficient breeding strategies (Billiau et al. 2012; Wolfert et al. 2017). Plant databases are genetically diverse datasets in the form of excel spreadsheet like tables, correlated schemas, maps, queries, etc. Typically, genetic databases are comprised of phenotypic, genotypic and genomic data sets (Table 1.1). In this study, the main focus will be on phenotypic trait data repositories. A phenotypic trait is a distinct variation of phenotypic characteristics of an organism; it may be either hereditary or due to environmental effects, but characteristically occurs as a blend of both (Lawrence 2005). For example, seed color is a character of an organism, while green, yellow and brown are traits. This study will summarize trait phenotype data repositories, associated analysis and data mining tools. This study also provide an overview of statistical analysis techniques and interactive analysis tools for characterising phenotypic data as well as propose a conceptual blueprint for future analysis. Generically, crop plant repositories are facing following challenges: Integration and management of high throughput phenotypic data. Comprehend architecture of multifaceted crop phenotypic traits and direct assessment of phenotypic data. Navigation from genotype to phenotype. Development and implementation of data retrieval tools to make the best use of metadata. 17

19 Development and implementation of analysis tools without forfeiting predictability. Collaboration between biologists, bioinformaticians and statisticians. Glossary Data entities for genetic information used to encapsulate the scope of crop databases (Table 1.1). Phenotype: Observable characteristics of an individual, due to gene and environmental interactions, is referred as phenotype, for example, height, weight, color, etc. Phenome: The complete set of phenotypic characteristics of a species(soule 1967; Houle et al. 2010). Visual markers: Marker that is a phenotypic expression and can be visualised in a crop plant, for example, seed length, drought tolerance, etc. Gene: A basic hereditary unit gene is a bead on DNA string (Pearson 2006). Genotype: Genetic makeup of an individual or group of individuals with reference to either a single trait (a nucleotide), a set of traits (a large genetic locus) or a whole compound of traits (entire genome). Genome: The genetic substance of an organism that comprises of DNA includes both coding and non-coding regions (Brosius 2009). Genetic markers: Markers can be a gene, a visual characteristic or expression of the genes. Markers help in the selection of the particular type of character or parameter use to separate many different organisms from a group of organisms. Genetic markers further divided into molecular and biochemical markers. 18

20 Molecular marker: Molecular marker, are exact areas of the DNA, used to identify the autosomal (any chromosome other than a sex chromosome) recessive genetic condition (BRADLEY et al. 1998). Genetic markers can be visualised by different molecular techniques, for example, SNP, SSR, ISSR, RFLP, AFLP, RAPD, etc. Biochemical marker: Biochemical markers used to identify expressions of genes and difference at the gene product level, for instance, variations in amino acids and proteins (Mosby 2009). Enzyme and hormone activities are the type of biochemical markers. Genetic map: Genetic maps also known as linkage maps are a visual representation of the organisation of chromosomes, used to link phenotype to gene or region of the chromosome and classically depends on the markers (Hyten and Lee 2016). Physical map: Physical maps are a visual representation of collected genetic markers and gene loci using actual physical distances generally measures in a number of base pairs. Quantitative trait loci (QTL): A statistical technique for quantification of genetic variation in complex traits by linking trait phenotype with gene markers (Falconer et al. 1996; Lynch and Walsh 1998; Miles and Wayne 2008). Model organisms: Easy to maintain and breed non-human species, used for research purposes to understand the biological phenomenon (Fields and Johnston 2005). Pedigree: A hereditary history of families (Lussier and Liu 2007). Germplasm: Germplasm is collection and preservation of living genetic material use for research and breeding purposes, for example, seed, tissues, etc. 19

21 Accession: An accession is generally regarded as a distinctive sample of germplasm upheld in a collection (Hamilton et al. 2002). Metadata: Data that describes information about other data. 1.1 Role of data repositories for plant research and breeding The primary role of crop genetics data repositories is to provide easily accessible and well curated phenotypic, genotypic and genomic data for plant research and breeding. Such data repositories integrate data mainly from two sources; direct from experiments (primary data) and from published literature (secondary data) (Kattge et al. 2011b). Collaborative organisations for crop plant research are actively working on various projects and collecting genetically diverse data from multiple species and various crops. Examples of such global collaborative organisations that have contributed include the Consultative Group for International Agricultural Research (CGIAR) [URL 3] for multiple major projects, Crops For the Future (CFF) [URL 4] for underutilised crops, International Rice Research Institute (IRRI) (McLaren et al. 2005) for rice, the International Maize and Wheat Improvement Centre (CIMMYT) [URL 5] for wheat and maize, etc. An additional role of data repositories for crop plants is to provide a platform for the ultimate end use of genetics data. This data can underpin the breeding of new cultivars and may be used by those engaged in genetic resource conservation, prebreeding, breeding and associated experimental research. This role can be achieved by analysis tools associated with such repositories, where productivity, competence and economy of the cultivars can be enhanced. A well-structured crop genetics data repositories will not only provide phenotypic data but also allow analysis to understand complex biological processes (Cobb et al. 2013). 20

22 1.2 Scope of existing databases relevant to crop genetics and prebreeding R&D Plant genetic resources and description of experimental materials In order to capture the immense size, variety of crop phenotypic data and complex genotype to phenotype relationship; there are different databases aggregating information according to research settings and goals. A comprehensive collection of crop genetics databases based on several factors, e.g., key features of the database, data access policy, availability of stored data type, sources of information, quality of data, etc. is listed (Table 1.1). The compilation (Table 1.1) also provide genetic resources and description of the experimental material of few selected databases that specifically contain trait phenotypic data along with other types of data which may or may not contain analysis tools. Crop genetics databases (Table 1.1) are mostly designed using open source scripting languages and software, e.g., structured query language (SQL), Java, hypertext preprocessor (PHP), practical extraction and report language (PERL) (Zimmermann et al. 2004; Lee et al. 2005; Love et al. 2012; Uszynski 2015a). Data, related information and user functionalities are explained in most database papers with only minor information on the design and software application (Grant et al. 2010; Groth et al. 2010; Kattge et al. 2011a). Only a few of them address database development (Lee et al. 2005; Love et al. 2012; Smith et al. 2012). Generally, crop genetics databases are either single-site repositories called centralized databases (Fernandez-Ricaud et al. 2005; Lee et al. 2005) or federated repositories based on several interrelated data sites and unify other sources with centralized database called integration database (Grant et al. 2010; Love et al. 2012; Krishnakumar et al. 2015b; Uszynski 2015b). Both central and integrated databases follow some standards and nomenclature, facilitate with web interfaces, and offer flexible options for data mining, displays, downloads, and analysis. A number of crop genetics databases have been developed for genotype and phenotype crop speciesspecific data management. The International Crop Information System (ICIS) is implemented by the International Rice Information System (IRIS) to work with genetic resources and crop cultivars management (Bruskiewich et al. 2003b). 21

23 Table 1.1: Genetic resources and description of experimental materials. Database Plants Key Feature Plant Databases: KDDart- Knowledge Discovery and Delivery Art Grin Global DB InterStoreDB TRY DB Brassica Information Portal Germinate DB SGN - Sol Genomic Network Gramene ARAPORT - Arabidopsis Information Portal Wheat, Barley, Brassica, etc. Various plants. Brassica, Underutilized crops Various plants. Brassica Wheat, Barley, Maize, Pea, Potato, Grasses Arabidopsis Potato, Pepper, Tomato, etc. Rice, Maize, model plant species Arabidopsis Plant and animal breeding. Genome profiling. Germplasm info about plants, animals, microbes and invertebrates. Navigation between genotype &phenotype with metadata. Underutilized crops data in the pipeline. Web-archive of plant s biodiversity at the global level. Promote traitbased methods. Population and trait scoring info about Brassica breeding community. Depository of substantial germplasm with more advanced data types. Allow linking of the phenome to the genome Facilitate the study of crossspecies comparisons. A one-stop-shop for Arabidopsis thaliana genomics. 22 Phenotype Environment Grown Genetic Maps QTL Genome Data Entrant Data Access Policies Demo/Help Manual Phenotypic data Analysis Tools C CA AC C OS A C OS A C OS S C,U OS S C RA AC C,U OS A C OS A U OS A Web Address yarrays.com/ kddart rs-grin.gov/ re.org try-db.org ac.ac.uk/ on.ac.uk/ germinate/ nomics.net/ ramene.org/ arabport.org/

24 TAIR - The Arabidopsis Information Resource SoyBase DB Panzea DB The Triticeae Toolbox IRIS - International Rice Information System CerealDB Ensembl Plants Arabidopsis Soybean Maize Wheat, Barley Rice Wheat Various plant species. Human and Animal databases: OMIM-Online Mendelian Inheritance in Man Prophecy DB PhenomicDB Humans Yeast Model organisms and homo sapiens. Provides genetic and molecular biology data for Arabidopsis thaliana. Provides interlinked data on genomic, traits, phenotypes and related resources. Examine relations between phenotype and genotype of complex traits. Germplasm line, pedigree, genotype and phenotypic data and core germplasm collections provider. Global genetic and germplasm rice info. Links genotype to phenotype. It offers various facilities for the study of the wheat genome. A genome-centric portal for plant species. An online Directory of human genes and genetic disorders. Evaluates phenotype of deletion strain based on growth behavior. Designed to mine, filter and visualise genome-wide data. Integrated metadata search engine for the phenotype. Compare known phenotype for a given set of genes. C,U OA A C OS A OS NA U,C OS NA CA NA U,C OS A U,C OS A C OS A OS S U,C OS A rabidopsis.org / e.org anzea.org/ aetoolbox. org/ org erealsdb.uk.n et/cerealgeno mics/cereals DB/indexNE W.php ensembl.org mim.org/ cy.lundberg.gu.se henomicdb.de / 23

25 Selected databases specifically contain trait phenotypic data along with other types of data. Only major crops of each database are mentioned. Animal/Human databases are selected either because of better functionality or interactive tools. Legends: Trait phenotype, ; Environment grown,, Genetic maps, ; QTL, ; Genome,. Data entrant: C, Curator; U, Users. Demo/Help manual: A, Available; AC, Available to clients; S, Self-explanatory; NA, Not available. Data access plan: OS, Open source; OA, Open to academics only; CA, Controlled access; RA, Restricted access (GNU General Public License required). Phenotypic data analysis tools: ; Tools are available. It offers the researchers and breeders a platform for meta-analysis of rice crop data (McLaren et al. 2005). However, the use of ICIS is not feasible due to lack of maintenance. Germinate is a generic repository which focuses on data management and assimilates plant species-specific published information about genotype, phenotype, pedigree, accession and markers metadata sets (Lee et al., 2005) but not linked to QTL or genome. The Panzea database houses and shares phenotypic, SNP, sequencing, isozyme and molecular data via Genomic Diversity and Phenotype Data Model (GDPDM) and Genetic Diversity and Phenotype Connection (GDPC) respectively (Canaran et al., 2008). Although the above mentioned databases are appropriate for data collation, genetic loci mapping to genomic data through metadata has not been fully adopted (Love et al., 2012). Soy database (Grant et al., 2010) has interlinked partnership with PLEXdb (Wise et al., 2007) for data distribution and lodgings which allow data exploration, experimental design, and analysis of the experimental results in genetic and genomic settings. The Try database (Try DB) (Kattge et al. 2011a) is a vegetation researcher s platform offers relational database containing quality checked, standardized both published and unpublished trait data for various plant species, free text phenotype descriptions and unrestricted access to meta-data but not linked with QTL or genome. It is not equipped with any analysis tools and only a subset of the data is directly downloadable. Some crop genetics data databases have user-friendly interfaces and often explicitly facilitate specific research communities. The triticeae toolbox (Carollo Blake 2013; Blake et al. 2016) uses trait ontology terms, specific trait names, and descriptor 24

26 methods and offers both raw and processed germplasm, SNP, phenotypic, and pedigree information from breeding programs along with limited tools for phenotypic data analysis but not linked to QTL or genome. Brassica information portal (BIP) [URL 6] is another depository dedicated for information related to Brassica breeding community which provides population, trait scoring and QTL information and links phenotype with genotype stored in external databases (Eckes et al. 2017) but not linked to the genome and not providing any statistical analysis tool. There are relatively fewer databases for phenomics data. PhenomicDB is a compiled repository from diverse organism-specific databanks which allow browsing and comparison of known phenotype for a given set of genes from different model organisms (Kahraman et al. 2005). There are also a number of crop genetics databases that contain germplasm collection. A global Germplasm Resource Information Network (GRIN) (Cyr et al. 2009) is a management system for facilitating the researchers and breeding communities with global crop genebank and plant genetic resource data, however, not linked to QTL. Panzea database (Canaran et al. 2008), on the other hand, tries to make predictions and help to understand regulations and heterosis in the context of the genome by focusing on a wide range of germplasm. To drill down relationships and similarities, SolGenes database (Paul et al. 1994) brings together genetic information in a single data repository logically related to several categories of information, such as germplasm collections and pathologies, DNA sequences, genetic and physical maps along with other types of information which allow linking of the phenome to the genome. Sometimes, relatively old database management systems unable to accommodate high voluminous and rapidly increasing phenotypic data efficiently and ultimately become redundant. For example, the Arabidopsis Information Resource (TAIR) has been overtaken by Arabidopsis Information Portal (AIP) (Abbreviated as ARAPORT) in order to manage increasingly growing information on Arabidopsis thaliana. It facilitates the Arabidopsis research community with flexible data mining and retrieval options for bulk download. Assimilation of ARAPORT offers only community curated genomic data and related analysis tools, whereas phenotypic 25

27 data is in progress (Krishnakumar et al. 2015b). Diversity array Technology (DarT) recently offered a new IT infrastructure, Knowledge Discovery Diversity array Technology (KDDart) (Uszynski 2015b) in order to handle high volume data. KDDart - a commercial depository to mainly benefit breeders, is a modular platform with the capability to integrate high-throughput data originated from several sources, where, core module stores any type of raw and processed phenotypic data from breeding or natural populations (Uszynski 2015b) which has controlled access to the end users and all services are paid. There are few databases that facilitate comparative analysis of genome resources. Gramene database (Yamazaki and Jaiswal 2005) is a grass genome resource contains extensive genetic and genomic data aggregated from both published and unpublished sources to facilitate the research community. From Gramene database, curated and automated relationships and displays can be queried using controlled vocabularies, however, queries need some prior knowledge or guesswork and QTL are not easily visible in CMap Viewer. The CerealDB (Wilkinson et al. 2012) is an integrated online genomic resource for scientists and plant breeders. This resource contains a range of genomic datasets and offers online and searchable database of SNP markers (Wilkinson et al. 2012), however linking of SNP data with phenotypic data is not provided yet. The Ensembl plants (Bolser et al. 2016) is part of the Ensembl genome browser and annotation system which is a centralized resource offering genome-scale information for an increasing number of sequenced plant species. Ensembl Plants offers access, retrieval, analysis and visualization of plant genome-scale data for multiple species (Bolser et al. 2016). However, it does not offer any phenotypic data analysis tools. In order to maintain assurance in evaluation and reproduction of associations between trait phenotypic data, CropStoreDB [URL 1] which is one of the core databases of interstoredb facilitates the end users with raw datasets, relevant metadata, versions, descriptions, parameters and applied algorithms (Love et al. 2012). A substantial overlap of data sets can be noticed in CropStoreDB [URL 1] and BIP [URL 6]. Although CropStoreDB [URL 1] has coverage to a wide variety of data 26

28 types along with excellent data query setup but it lacks phenotypic data analysis tools Phenotypic trait and trial data Trial data integration is a time-consuming and expensive task but a vital step in crop breeding which helps in making several decisions from cultivar improvement to end use (Yan 2014). Crop genetics and breeding research studies produce the bulk of phenotypic data for each genotype from diverse locations and environmental conditions, but, practically raw data is not publically accessible (Zamir 2013). Only a few crop databases facilitate with raw trial data (Love et al. 2012; Uszynski 2015b; Blake et al. 2016). Phenotypic traits are essential for comprehending and predicting vegetation reactions to global variations; consequently, they demand well-organised and effective database tools for rapidly aggregating information of trait data (Kattge et al. 2011b) which are missing in most of the crop databases (Table 1.1). Plant trait information is also discerning for ecological modeling parameters of vegetation features (White et al. 2000; Kutsch et al. 2009). Crop databases, such as InterStoreDB, Germinate DB (Lee et al. 2005), IRIS DB (Bruskiewich et al. 2003b), GRIN GLOBAL (Cyr et al. 2009), KDDart DB (Uszynski 2015b), provide phenotypic trait and trial data sets along with other relevant information (Table 1.1). From above discussion, we can see that there is a clear research gap and an obvious need for trait phenotypic data analysis tools in crop genetics databases. Less than satisfactory results towards improved crop plant indicates that numerous technologically advanced genomic resources cannot be fully utilized for genetic improvements, particularly, for multifaceted quantitative traits, where, shortage of precise and high-throughput phenotyping tools is one of the key cause (Mir et al. 2015). 27

29 1.3 Data quality, meta-data and consistency Reproducible research outcomes require high standard data quality in crop genetics database. The quality of data can be achieved by implementing transparency and setting robust guidelines within crop genetics database. The quality of data can be maintained by consistent and unambiguous curation (Love et al. 2012). Most of the crop databases are maintained by the curators (See Table 1.1). Data consistency confirms that database restraints are not violated, particularly once a certain transaction occurs (Haerder and Reuter 1983). It follows that data transaction operations are accomplished precisely, appropriately, and rationally with respect to database interpretation (Michiels 1998). As there is no direct system to transfer published finding into the database, curator s effort make the data compatible for structured data repository (Brookes and Robinson 2015). Moreover, quality and utilization of data can be enhanced by standardization of traits which facilitate the comparison between diverse varieties (Kattge et al. 2011b). However, all research questions cannot be answered through standardized phenotyping structures, but specific traits can be examined in high-throughput settings with well-defined objectives and detailed attention (Cobb et al. 2013). Phenotypic experiments are performed under certain conditions to address specific questions but publish only most relevant information, whereas plenty of unused information remains in databases which need to be assimilated as meta-analysis (Granier and Vile 2014). A meta-analysis pools multiple studies to detect common effects when effects are consistent across studies and explain the variation when effects vary (Rothman et al. 2008). The importance of metadata has been immensely increased with rapidly assimilating data as it can increase long-term usage and guarantee the persistence of data (Shrestha et al. 2010). Phenotypic plasticity, an ability of one genotype to exhibit different phenotype under different environmental conditions (Whitman and Agrawal 2009), is a complicated quantitative trait (Massonnet et al. 2010; Pérez-Harguindeguy et al. 2013), can be addressed with fewer obstacles in the presence of metadata. The concept of genotype by environment interactions is strongly dependent on plant phenotypes (Des Marais et al. 2013), and also the detailed characterisation of traits, calls for an utter need of metadata recording (Fabre et al. 2011). A Recent example of systems 28

30 that encapsulate metadata includes the InterStoreDB (Love et al. 2012), which is unique in the essence to facilitate the researchers with metadata to drill down information when needed. Trait and trial data is scattered in various online repositories and described with different vocabularies and keywords. These conflicting descriptions hinder researchers to make a comparative analysis. ISA-Tab (Rocca-Serra et al. 2010) was the first software developed to report various types of experimental descriptions. ISA-Tab format is flexible and considered as a generic solution for the separation of metadata from the data itself, due to experimental metadata description standard. ISA-Tab s main limitation is the lack of specified standard and reference ontologies which are required for data curation. Annotation of the trait data using common reference vocabularies is critical for data interoperability. Common reference vocabularies are also known as ontologies. Crop trait ontology hosts trait information to describe phenotypic traits in plants in order to standardize the experiments. There is currently a determined effort in the phenomics community to address this issue. The Planteome (Cooper et al. 2016) is a platform of plant reference ontologies which logically integrate diverse datasets and are developing tools for data analysis and annotation. MIAPPE is a semantic standard and check list for the description of plant phenotypic data with the aim of better data interoperability, and uses the ISA-Tab (Rocca-Serra et al. 2010) format for data collection and interoperability. For data standardization, MIAPPE (Ćwiek- Kupczyńska et al. 2016) Minimum Information About a Plant Phenotyping Experiment is a standard document for the description of a phenotypic experiment. This document serves as a checklist for a curator to confirm the presence of all important data characteristics required for interpretation and replication of the experiment. The Planteome project (Cooper et al. 2016) reference ontologies and lengthy vocabularies developed by their collaborators are difficult to display in a web browser. To overcome this issue, they developed a script to extract relevant terms. Likewise, MIAPPE document is also lengthy and confusing, but it is still immature and collaborating with developers for future developments. The breeding API (BrAPI) ( specifies an open and shared standard breeding application programming interface (API) to cater plant databases data to 29

31 crop breeding applications. It is also in the development phase. Standardized vocabularies and ontologies impact on data interoperability, consequently enabling the reuse of data by others (Cooper et al. 2016). 1.4 Data access, querying, and retrieval Collaboration between biologists, bioinformaticians and statisticians is crucial for data access, query, and retrieval. Data access options, querying facility and retrieval formats vary from one database to another. Well-equipped laboratories and refined computational resources will need to be employed to handle increasingly growing phenotypic data (Cobb et al. 2013). Some prior knowledge of globally unique identifiers or synonyms, other accepted scientific names and vernacular names of crop species are important to enhance data search in crop database (Ningthoujam et al. 2012). For example, in Gramene database (Table 1.1) curated and automated relationships and displays can be queried using controlled vocabularies (Yamazaki and Jaiswal 2005), however, user needs some prior knowledge or guesswork. Some crop genetics databases are not stand alone applications and depend on other software. For example, Germinate requires MySQL, Apache Tomcat, and GWT to run, the T3 (The Triticeae Toolbox) software requires UNIX, Apache, MySQL, and PHP. Both Germinate and T3 software are available under the GNU general public license (Lee et al. 2005). To uncover curated data from other sources, ARAPORT uses software from JBrowse project and the InterMine (Krishnakumar et al. 2015a). Solgene database and IRIS database are also software dependent and provide userfriendly interfaces to access, mine and retrieve information (Paul et al. 1994; Bruskiewich et al. 2003a), but needs some prior knowledge of existing datasets. CropStoreDB [URL 1] offers a web service for information access by choosing population, trait, genetic map or QTL interface of interest and enable users to navigate between core InterStoreDB databases through communal identifiers or cross references (Love et al. 2012). Easy navigation in CropStoreDB [URL 1] web service reflects all the hard work which has been done for the efficient setup. As discussed in chapter 2, complex queries are set up in the background in almost each crop database and users normally retrieve datasets by downloading a query- 30

32 generated or defined flat file format (Canaran et al. 2008), MS-Excel format (Paul et al. 1994; Cyr et al. 2009; Love et al. 2012; Eckes et al. 2017)) or compressed data format (Canaran et al. 2008; Grant et al. 2010). KDDart allows data retrieval in JavaScript Object Notation (JSON), Comma Separated Values (CSV) and Extensible Markup Language (XML) formats and Geolocation data in GeoJSON format (Uszynski 2015a). Data in Germinate database can be viewed and retrieved by GDPC browser, which is a multifunctional Java-based interface that issues data as web services and saves as XML file format (Lee et al. 2005). Retrieved data from various crop databases differ not only in data format but also in data types. For example, retrieved excel files from CropStoreDB (Love et al., 2012) consists of multiple workbooks which contain information about data provenance, data description, summary and map information, whereas retrieved flat file from Panzea database (Canaran et al. 2008) consists of one sheet of trait phenotypic data only. 1.5 Downstream analysis of crop genetics and related data sets Offline analysis tools A variety of statistical and data mining approaches has been developed to handle high throughput and multidimensional data. A few methodologies in the context of crop phenotypic data are discussed here in order to present trends in research and how end use of data analysis helping the researchers and breeding community. Correlation analysis is a scaled measure to quantify linear association among two continuous variables in terms of direction and strength. It ranges from -1 to +1 (Table 1.2), where sign indicate the direction and magnitude indicate the strength of relationship (Cohen et al. 2013). Seed production has been studied using correlation analysis for crested wheatgrass (Dewey and Lu 1959). It has a long history of usage in plant biology. It has been used to understand the association and contribution of phenotypes towards grain yield in order to study yield improvement in a flower (Yasin and Singh 2010). Two populations of soybean have been studied using correlation analysis of genetic and phenotypic quantitative traits in order to study pleiotropic effects (Recker et al. 2014). Sometimes correlation can be misleading in the presence of confounding effects. In this situation, regression analysis - a supervised learning technique, steps in to 31

33 evaluate the relation between one outcome variable (dependent or response variable) and one or more confounding variables (independent or predictors variables) (Cohen et al. 2013). Both measures relate in the sense that both deal with associations between variables. For example, if y is a yield of a crop (dependent variable) and x is a phenotypic variable (independent variable), such as seed weight, plant length, soil fertility, etc., regression model can be defined as follows y = β 0 + β 1 x 1 + β 2 x β k x k ; where, β is a model parameter. Regression analysis has been used in plant sciences for many years in order to study the relationship between variables as well as to identify confounding effects and prediction. Stability parameters defined using regression analysis in order to detect genetic differences in varieties over varied environments (Eberhart and Russell 1966). Sowing trend effects and crop phenotype have been analysed using regression analysis technique, on radiation intercept, use efficiency, growth, and production of Brassica juncea L. (Jha et al. 2012). Regression analysis can be largely divided into least square regression, nonlinear regression, factor regression, generalized least square regression (GLS), etc. Some of the most important multivariate techniques are discussed below. Principal Component Analysis (PCA) - an exploratory analysis technique, is one of the ways to reveal the underlying source of variation from confusing data. PCA is applicable to multivariate analysis (MVA) for data reduction, where all phenotypes of continuous nature follow Gaussian distribution (Yang and Wang 2012). PCA also reveal the internal structure with a lower-dimensional picture and extract dominant patterns in terms of principal component loading plots (also called Bi-plots) (Wold et al. 1987). PCA provides the basis for further analysis, e.g. factor analysis (FA). PCA and population structure of Barley germplasm are used to identify 5 major subpopulations within the United States department of agriculture (USDA) national small grains collection of 33,176 accessions, mainly distinguished by geographical origin and inflorescence trait architecture in genome-wide association studies (GWAS) (Munoz-Amatriain et al. 2014). PCA is a linear transformation of associated variables into linearly uncorrelated variables (p<q) to attain optimal variance in a data set, where principal components represented by p and q are original phenotypic variables. For example, first principal component is 32

34 p 1 = a 1 q 1 + a 2 q a k q k ; where, a i, i = 1, 2,, k are coefficients used to maximize the variance. PCA looks for the largest variance where normalized directions are selected for first principal component p 1 in n dimensional space to obtain maximized variation in phenotypic variables (q). Subsequent direction is sought with maximum variance, though, restrict the search only in perpendicular to all prior directions due to the assumption of orthogonality (perpendicularity). Selection will continue till n directions. Resulting orderly set of p s are Principal Components. (Shlens 2014). The canonical correlation analysis (CCA) technique is recommended only for high throughput data, to explore the correlation between two sets of multivariate vectors, all measured on the similar crop. CCA has been used for analysis of winter Squash population analysis to determining the relationship between plant character and yield constituent (Balkaya et al. 2011). Another study presented has used CCA for analysis of native melon population to find correlation among phenological traits and yield (Naroui Rad et al. 2014). If we consider two sets of multivariate vectors, U as the first set of phenotypic multivariate vector, and V as the second set of phenotypic multivariate vector, where: U = U1, U2,, Ux and V = V1, V2,, Vy are two correlated random vectors then CCA seeks for linear combination for which squared correlation between Ui (i=i, 2,, x) and Vj (j=i, 2,, y) is maximized (Härdle and Simar 2015). In order to test the statistical significance of correlation in multivariate analysis, multivariate analysis of variance (MANOVA) test is recommended (Muller and Peterson 1984). MANOVA is used to answer many research questions, for example, the importance of the dependent variables, effects and use of covariates, main effects and interactions between the independent variables and strength of the relationship between predictors, etc. MANOVA is highly sensitive to outliers. Outliers are values that lie far from other values and can be easily detected in a spread plot (or Boxplot, scatter plot, Scatter Plot Matrix). Basic assumptions, such as normality, linearity and homoscedasticity should be met (French et al. 2008). PCA 33

35 and CCA are often viewed in parallel as both describes optimal variations, where, PCA explains optimal variance in one data set and CCA explains cross variance in two data sets (Barnett and Preisendorfer 1987; Hsu et al. 2008). Cluster Analysis is an unsupervised learning technique to identify an intrinsic grouping of similar objects. It aims to minimize within group variation and maximize between group variation (Everitt and Hothorn 2011). Frequently used methods for clustering are hierarchical clustering, k-means method, and partitioning around medoids (PAM) clustering method. After years of effort, the biologist created taxonomy i.e. hierarchic grouping of living things. Cluster analysis helped to create mathematical taxonomy discipline, which has helped automatically discover hierarchical classification grouping. The concept of numerical taxonomy for classification via natural grouping for species and genera was presented almost 60 years ago (Sokal 1966). For example, a hierarchical clustering approach (often represented as a dendrogram) has been used to classify Chinese mustard on the basis of phenotypic differences in order to reveal the genetic diversity (Fu et al. 2006). Clustering analysis has also been used for understanding soybean phenotypes (Bi et al. 2014). PCA and clustering reduce dimensions of phenotypic data conserving the main axis of significant variations (Westoby et al. 2002; Laughlin 2014). These techniques are capable of mining dominant traits for refined groups and can be further examined using standard statistical approaches (Topp et al. 2013). Linear Discriminant analysis (LDA) is a supervised learning and data reduction technique used for classification purpose (Table 1.2). It is comparable with regression analysis which tries to sort variables with more impact (Fisher 1936; McLachlan 2004). LDA is also regarded as parallel to PCA and factor analysis (FA) as they look for a linear combination of variables which describes the data at the finest level. It looks for differences among groups and interested in predicting group membership. It can be also used for crop disease diagnosis. LDA of phenotypic data has been used for parent selection in kiwifruit breeding in order to identify the utmost dominant characters between populations (Daoyu and Lawes 2000). If we consider a training set that consists of 50% or 75% values of phenotypic attributes 34

36 (e.g. seed weight, seed length, etc.) for each sample of an entity with identified class (e.g. flower colour), then the aim of classification is to discover a decent predictor for the class variable by following the similar spread given only one observation (Venables and Ripley 2003). Data mining offers useful techniques to overcome the non-linearity problem, such as naive Bayes method, decision trees, etc. Some of the most important and frequently used statistical methods are presented in (Table 1.2). 35

37 Table 1.2: Summary of few of the current capabilities of statistical analysis methods for crop phenotypic data. Correlation Features/Aims The linear relationship between paired data. Methodology Ranges between -1 to +1. Sign indicate direction of relation. r.2 'very weak' r.4 'weak' r.6 'moderate' r.8 'strong' r 1.0 'very strong' Type of data Numeric continuous Conditions/ Assumptions Normality Linearity Homoscedasticity R packages/ Functions library (corrplot) cor() (Hmisc) rcorr() Graphics/ Displays Issues Correlation Plot Scatter plot Effected by Scatter plot outliers. matrix Heat map Dot Plot Matrix Grouped associations. Same as above. Numeric continuous Same as above. library Dot plot (Hmisc) Scatter plot Effected by dotplot2() Scatter plot outliers. (lattice) matrix panel.dotplot( ) Least Square Regression (LSR) Analysis (or Ordinary Least Square Regression Analysis) Robust Least Square Regression (RLSR) Analysis Estimate Fit least square regression relationships between equation. a dependent variable Compute & analyse residuals. and one or more Compute summary and verify independent statistical significance using F-test variables. and model fit by using R-square. Minimizes sum of the squared residuals. An alternative to LSR when data have outliers. Detect influential observations. First fit LSR equation. Use Cook's distance (or Cook's D) to combine information of leverage (observation with an extreme value) and residual of the observation. If outliers detected, apply RLSR. Last two steps same as above in LSR. Continuous, categorical (as predictors) Continuous, categorical (as predictors) Same as above. Same as above. library (MASS) lm() acf() library (MASS) rlm() Scatter plot with a regression line. Scatter plot Effected by outliers. Check error terms for autocorrelati on. RLSR does not address issues of heterogeneity of variance. (It can be handled by sandwich package after lm function) 36

38 Generalised Least Square Regression (GLSR) Analysis Mixed Effect Model Similarity Measure An alternative to LSR when residuals are not random and show a pattern (autocorrelation) Models and explore group level variation Identify the similarity between two objects. Used in clustering. Multivariate Statistical Analysis: Data reduction Principal Extraction of Component dominant pattern Analysis (PCA) Basis for further analysis. First fit LSR equation. If serial correlation detected, apply GLSR. R nlme package calculate autocorrelation by default. Model fitting will give Restricted Maximum Likelihood Estimates. Add an autoregressive process of order 1 AR(1) and check error term for autocorrelation. Stop, if no pattern, otherwise add AR(2). Continue until error term are randomly distributed. Continuous, categorical (as predictors) Fit generalised linear model (fixed effects model) and pick variables with non-significant p- values, say x and y. Data nested Introduce random effect models: within 1) with y and x as a random groups, effect; Continuous, 2) with only y as a random effect; categorical 3) with only x as a random effect. (as Fit mixed effect models and predictors) choose the one with lowest AIC and BIC values and least residual variance. Commonly used measures are Euclidean Manhattan, etc. To find optimal variance in one data set. Methodology already explained (section- Offline analysis tools). Numeric Numeric continuous variables Same as above. Same as above. Normality (Otherwise scale) Linearity Normality Orthogonality library (nlme) gls() library(lmer) (lme4) (arm) Scatter plot Scatter plot library SimilarityMea Heat Map sures apcluster library (stats) princomp() (psych) principal() Bi Plot, Scree Plot REML-GLSR can underestimat e standard errors. Dangerous to use when number of blocks are very small. Effected by outliers. Not scale invariant 37

39 Factor Analysis (FA) Cluster Analysis Classification: Linear Discriminant Analysis (LDA) Quadratic Discriminant Analysis (QDA) After data reduction FA identify unobservable events. Various approaches for orientations of factorial axes, such as varimax rotation, factor rotation, etc. Identify natural grouping. agglomerative and divisive clustering technique Dimension reduction techniques. Find Differences b/w groups. Predict group membership. Help in disease diagnosis. (Same for both methods- LDA and QDA) Select number of factors using PCA. Apply factor analysis and measure factor scores. Plot factor scores to identify hidden phenomenon. Factor scores used for further analysis, such as Regression. Both objective and subjective attribute. Select initial item, it will become center of initial cluster. Search next item until least Data from distant one found from previous multivariate cluster centroid. This is center of the next cluster. multinomial Repeat 2nd step until k clusters distribution obtained and no item left for s. assignment. Initial clusters obtained by nearest cluster center assignment. Seeking a linear combination with largest separation between groups. LDA or QDA (Bartlett s test). Select prior probabilities. Use posterior probabilities to assess the uncertainty of classification. Assess efficiency through confusion matrix. (Same for both) Numeric with one categorical (class) variables. (Same for both) Linearity of common factors. Variancecovariance matrix is symmetric. Multivariate normality Homogeneous variancecovariance matrices. Heterogeneous variancecovariance matrices library(stats) Scree plot, No direct factanal() Factor Plot analysis. (FactoMineR) (Factoshiny) library (PLINK) (candisc) library (MASS) lda() qda() (klar) (Same for both) Dendrogram, cluster plot, Heat Map, Silhouette plot Scatter Plot Scatter plot matrix. Histogram with overlaid density curve. (Same for both) Impose hierarchy even does not exist. Misclassification risk. Overfitting problem. (Same for both) Features, methodologies, type of data a particular method deals with, assumptions need to be fulfilled prior to application of the certain method, some of the available R libraries with their functionality corresponding analysis, graphical displays associated with the specific method, and associated likely issues of the statistical methods are presented. 38

40 1.5.2 Online analysis tools Only a few of the available crop databases provide online tools for data mining and statistical analysis for phenotypic data. KDDart is one of the modern databases that offers multiple applications to the clients, where, KDSmart is for data collection, KDMan is for data management purpose and KDCompute is for data mining and statistical analysis, (Uszynski 2015a) but not publically available. Triticeae Toolbox (Blake et al. 2016) (Table 1.1), is the most straightforward and user-friendly database that offers basic data analysis tools (Histogram, Boxplot, trait vs trial comparison, etc.) for crop phenotypic data along with some clustering techniques other than genotype and genomic data analysis tools. datpav is a web platform and an interactive exploratory tool for basic data analysis along with ready to publish visuals (scatter plots, box plots, heat map, pie chart, Venn diagram, etc.) which allow users to load data and perform any of the three available data analysis; metabolomics, environmental and hydrodynamic (Biswas et al. 2011). Several crop databases have visualisation and analysis tools as part of the database in their future goals (Bruskiewich et al. 2003a; Lee et al. 2005). Except crop databases, there are few other databases that also offer data mining and statistical analysis tools. Prophecy (Table 1.1) facilitate profiling of phenotypic characteristics in yeast over various environmental encounters, which has specific data analysis tools to evaluate the robustness of protein complexes and gene dispensability in regions of chromosomes(fernandez-ricaud et al. 2005). A limited number of genotypic and genomic databases also offer data mining and statistical analysis tools. GENEVESTIGATOR (Zimmermann et al. 2004) is an extensive microarray gene expression repository, which offers multiple analysis tools for ribonucleic acid (RNA) profiling experiments, such as, gene correlator, which uses Pearson s correlation to compare two genes intensity values with fully annotated and interactive graphic for identification of GeneChips, and Meta-Analyzer, which studies a number of gene expression profiles at once with respect to anatomy, growth phase, and environmental features and output normalized signal values as a heat map. GeneSetDB offers a simultaneous comparison of gene sets available in the database through visualisation capability of clustered heat map (Araki et al. 2012). 39

41 1.6 Summary and conclusion Comprehensive phenotyping is an important complement to genome sequencing and considered as a major progress in biology in the last two decades (Schork 1997; Schilling et al. 1999; Bilder et al. 2009; Houle et al. 2010). However, phenotypic data collection and management is still a challenging task. There is also a considerable lack in the availability of public domain databases and analysis tools to explore those databases. A few databases available in the public domain are also limited in their scope as most of the data are stored in tables, accommodating only a few dimensional entries of trait data which forces to ignore rest of the dimensions (Madin et al. 2007). This restraint leads the researcher to only include most important information and rest remains in experimenter s mind or notes (Michener et al. 1997). Another significant limitation is the lack of proper metadata or any other form of accompanying metadata, which results in inconsistency in genotype by a unique phenotype characterisation and limits comparison between datasets (Granier and Vile 2014). Although technology has been immensely advanced, human efforts are still needed for phenotypic data compilation, coding, and interpretation that is a rate limiting and time-consuming procedure (Lussier and Liu 2007). Moreover, data features and quality varies from one database to another and at present there are insufficient methods for data retrieval and analysis. So, there is crucial need of phenotypic data management along with the development of novel tools (Lussier and Liu 2007). 1.7 Research gaps Plant scientists are greatly concerned with the improvement of phenotypic characteristics such as yield, drought and pest resistance in crops plants, as well as seed lengths, etc. Phenotypic diversity is mainly a consequence of complex interactions between genotype and environment. Development of statistical modules and applications will enable knowledge integration and sharing of crop 40

42 phenotypic data in relation to genotype by environment interactions (Boer et al. 2007). Based on the review of current literature, various plant databases are providing researchers with high quality data from multiple biological domains, e.g. phenotypic, trait and omics data (Table 1). A number of plant databases provide data management tools. However, very few databases are equipped with data analysis tools (Carollo Blake 2013; Uszynski 2015b) and most of these tools are available offline. Plant database analysis tools also provide limited resources to assist users. Moreover, the available analysis tools are inadequate to answer a number of research questions such as comparison between members of a species, variations between trait scores of plant populations grown in different countries, etc. A parsimonious but comprehensive data structure with rich trait phenotypic contents along with easy and free access to datasets and interactive tools is desirable for the ease of researchers and breeders, which will help them in understanding the underlying phenomenon of phenotypic characteristics. 1.8 Aims The main aim of this project is to provide a refinement of statistical online tools and to develop an interactive workflow for plant trait phenotypic data using open source software. These online tools will facilitate the researchers to navigate data and enable users to perform online and real-time analysis for trait measurements and comparative trait analysis using a range of statistical techniques. Refined tools will be part of an existing well-established interface of CropStoreDB [URL 1] (Chapter 3). CropStoreDB [URL 1] is one of the core databases of InterStoreDB (Chapter 2) for managing and storing trait measurement data along with genetic information (Love et al. 2012). 41

43 1.9 Objectives 1. Preparation of data from CropStoreDB database [URL 1]. 2. Data exploration in CropStoreDB [URL 1]. 3. Data navigation and download. 4. Visualise distribution of the navigated results. 5. Data comparison using statistical graphs. 6. Prepare dataset for analysis in form of pivot table. 7. Apply statistical analysis techniques on pivot table Rationale Genericity, simplicity, navigability and flexibility were favoured for this research. Evidence from Triticeae Toolbox (T3) (Blake et al. 2016)(Chapter 1) which offers data download and limited trait phenotypic data analysis capability, suggested that data from databases can be fetched, visualised and analysed online. The refined tools used to fetch CropStoreDB [URL 1] (Chapter 2) data online from MySQL relational database management system, and provide flexibility of navigation, visualisation and analysis. Tools are generic and can be adopted by other plant species databases, for example, wheat, corn, etc. Analysis tools are designed to enable researchers to understand, help navigate and make decisions about crop phenotypic data. Comparison of composite phenotypic trait datasets from different trialing regime enable the evaluation and comparison of distributions. Analysis tools are intended to help geneticists or plant breeders to assess data, frame questions and make informed decisions related to genetic variation. This is achieved by computing and analyzing complex phenotype both within and between plant varieties Experimental approach CropStoreDB [URL 1] was selected as the most appropriate database for addressing project objectives. It is a relational database, consisting of phenotypic data which is accessible from multiple interrelated tables (MySQL relational database 42

44 management system). A master table was prepared by joining interrelated tables of trait phenotypic data (Chapter 2). A powerful and elegant package Shiny from an open source statistical computing language R [URL 7] was used to turn statistical analysis tools into interactive web applications without knowledge of CSS, HTML, or JavaScript (Chapter 3). Options for data retrieval and download are provided. Tools are implemented and tested for each module along with workflow via the use of a case study and disseminated into readily available CropstoreDB [URL 1] following robust computer engineering practices. Accuracy and computational speed were considerations in the development of data navigation, exploration and analysis tools. 43

45 CHAPTER 2: DATA PREPARATION 2.1 Introduction Despite valuable research on crop cultivation, the majority of crop and pasture improvement R&D, selection and breeding community is facing a multitude of problems to maximize crop yield for food security, poverty mitigation and overall sustainable progress. One of the reasons is poor availability to the data generated from previous studies and experiments (Janssen et al. 2012). Data that currently reside in crop genetics data repositories are mainly accumulated from two sources, either from field trials in the form of raw data or from scientific publications as secondary data (Kattge et al. 2011b). Raw data may be generated in experiments prior to statistical analysis. In contrast, secondary data have previously been analysed and may be available for analysis to address new problems from old data, or address original research problems with an improved statistical approach (Glass 1976). Accumulation of crop-specific data may provide new opportunities for research and breeding communities, although there can be major challenges in making them accessible to users (Janssen et al. 2012). Data accumulation and retrieval is a complicated process, where the complexity of data aggregation and recovery may be due to various factors of which the most common include: 1. Prior to integration, raw and secondary data must pass through a data processing pipeline to make it compatible. This is typically carried out by curators according to the standards of the particular database, and may require data cleaning, scaling or normalizations. 2. Data are entered into the database manually using various forms and interfaces. Manual data entry may result in common errors, such as, mistype, entry in the incorrect box, selection of incorrect data from the list, etc. 3. Data accumulated from diverse sources may collectively be inherited to become big data, which causes a series of issues, such as, heterogeneity, noise, spurious correlation, etc. (Fan et al. 2014). 44

46 4. A solution to these issues lies in the adoption of FAIR data principles, which are becoming the key conduit in leading knowledge discovery and innovation to boost the reusability of already published data. FAIR data consists of four principals, Findable, Accessible, Interoperable and Reusable (FAIR) (Dillo et al. 2016). Data are exchanged between databases (interoperability). Entry in one database immediately propagates data to other downstream databases. On the fly, data exchange reduces the ability of users to verify correct initial data entry. Speed and data quality cannot always go hand in hand. A generic structure for accessing raw or secondary crop genetics and breeding R&D data is presented in Figure 2.1. It serves as a guide, where the complicated process of data accumulation and retrieval is directed through arrows. The simplest way to collect raw data is directly and contemporaneously from field or other experiments. However, the management of field experiments is often a time-consuming and error-prone process. One of the approaches to store crop field and other experimental data is in a specialised schema implemented in a generic Relational Database Management System (RDBMS). RDBMS present a simple and adaptable data management model which can comprise a user interface, set of tables and SQL engine [URL 8]. Relationships between entities (such as plant_line or trait_descriptor) is generally presented in a relational schema. Various database management and data exchange systems have been used for crop plant R&D databases (Table 2.1). Literature and documentation involving crop genetics databases are often poorly described and very few of them explain the schema of database (i.e. the relationships between entities) or the methods used for data aggregation. Only a few papers are dedicated to explaining specific database schema (Lee et al. 2005; Love et al. 2012) and there is perhaps none which describe data aggregation methods within crop genetics data repositories. 45

47 Table 2.1: Examples of crop genetics plant database resources, management and data exchange systema. Databases Data types Management/ exchange system Reference CropStoreDB Genetics MySQL (Love et al. 2012) InterStoreDB Genetics & Genomics MySQL (Love et al. 2012) Germinate Genetics PostGres (Lee et al. 2005) Triticeae Toolbox Genetics MySQL (Blake et al. 2016) Ensembl Genomes Genomic MySQL (Kersey et al. 2014) Gramene Genetics & Genomics MySQL (Ni et al. 2009) KDDart Genetics & Genomics Data Access Layer (Uszynski 2015b) BIP Genetics & Genomics JSON (Eckes et al. 2017) Cereal DB Genomic MySQL (Wilkinson et al. 2012) A generic data flow framework and data aggregation at the local level, has been developed to describe repositories available at national/multinational level (Figure 2.1). Data can be retrieved from any level according to the demand of underlying studies and can be shared by global unique identifiers such as accepted scientific names or vernacular names using metadata associated with primary or secondary data sources (Ningthoujam et al. 2012). The overall objective of this project was to refine interactive statistical tools for exploration of diverse sets of crop phenotyping data in the research field of crop and pasture improvement, genetics, selection and breeding. In order to achieve this and demonstrate the application of interactive tools, it was decided to use existing datasets managed within the CropStoreDB database [URL 1], which was originally one of the core databases of InterStoreDB (Love et al. 2012) InterStoreDB Numerous studies have demonstrated that crop phenotypes are the consequences of genes interacting with the environment (Faccioli et al. 2009). However, the ease with which plant and crop scientists may navigate between genotype and phenotypic datasets remains a momentous challenge. The InterStoreDB database (Love et al. 2012) was a unique demonstration step towards achieving the goal of navigating between genotype and phenotype data and had distinctive features, such as, metadata delivery to users for cross-database analyses, including capability to 46

(Redrawn from (Ningthoujam et al., 2012).

48 Figure 2.1: Generic data flow framework for developing a crop genetics database at the local level and sharing the data at the global level. (Redrawn from (Ningthoujam et al., 2012). Acronyms used for global repositories associated with plant DB s: EBI, European Bioinformatics Institute; NCBI, National Centre for Biotechnology Information; TAIR, The Arabidopsis Information Resource; GBIF, Global Biodiversity Information Facility; NBII, National Biological Information Infrastructure. Figure 2.2: InterStoreDB databases schema. Source: [URL 10] 47

division of captured and associated meta-data along with additional interfaces

49 Figure 2.3: System architecture of InterStoreDB with three core databases with the division of captured and associated meta-data along with additional interfaces providing navigable links between databases. Source: (Love et al. 2012). Figure 2.4: The entity relationship diagram for CropStoreDB database Source: [URL 10]. 48

50 drill down to data provenance. InterStoreDB represented a generic suite of integrated databases that was first demonstrated with Brassica-specific datasets that encompassed from phenotype to genotype [URL 9]. However, the approach could readily be adopted for other plant species, such as, wheat, corn, etc. InterStoreDB was a collection of databases with three core databases: CropStoreDB [URL 1], SeqStoreDB and AlignStoreDB, stores genetics data related to plant experiments, sequence related data and sequence alignment information respectively (Love et al. 2012). The entity relationships of CropStoreDB database [URL 1] and their links with SeqStoreDB and AlignStoreDB are shown (Figure 2.4). The InterStoreDB [URL 9] framework is a single portal which enables navigation of the manifold biological data involved in genetics and genomics research. Moreover, metadata of each core database provide users with information for assessing comparative associations. The CMAP tool (Youens-Clark et al. 2009) was used to view and compare genetic maps and functional annotation of sequenced genomes was provided via an Ensembl-based genome browser as shown in Figure 2.2 and Figure 2.3. A schematic interlinking of databases within InterStoreDB [URL 9] and details of captured and associated metadata are presented in Figure CropStoreDB CropStoreDB [URL 1] comprises a relational database schema which consists of multiple interrelated tables. It is designed to manage genetics data and is equipped with raw datasets, relevant metadata, versions, descriptions, etc. (Love et al. 2012). Each entity is linked to one or two other entities, describing within CropStoreDB relationships and how CropStoreDB is linked to other databases. Entity diagram consists of six data sections superimposed on entity relationship diagram for the CropStoreDB database [URL 1] are displayed in Figure 2.5. Different sections of the entity relationship diagram cover specific data types as shown in Figure 2.5. Population data provide access to curated datasets related to plant populations. Trait data provide access to trait measurement data. QTL provide 49

51 links to trait phenotypic data and linkage maps, to identify quantitative trait loci and makes it possible to view traits that have been scored on diverse populations. Linkage maps provide information related to the genetic maps correspond to the population of which the trait was scored. Genotype data provides population genotypic data links to population data, genetic linkage maps and marker essays to get annotated genome sequence and genetic markers and helps in navigation between genotype and phenotype data. Marker essay stores sequence related data by using internal data links to trait data, alignment data, genotype data and external data links to sequence data repositories, i.e. Genebank, BrassicaEnsembl, etc. (Love et al. 2012). CropStoreDB database [URL 1] offers a web service for information access by choosing population, trait, genetic map or QTL interface of interest and enable user to navigate between core InterStoreDB databases through communal identifiers or cross references (Love et al. 2012) as shown in Figure 2.6. Easy navigation in CropStoreDB database [URL 1] web service reflects all the hard work which has been done for the efficient setup. From the online implementation of the CropStoreDB database [URL 1], data can be downloaded in an EXCEL file as shown in Figure 2.6. The retrieved excel file consist of multiple workbooks which contain information about data provenance, data description, summary and map information (Love et al. 2012). Although the CropStoreDB database [URL 1] has coverage to a wide variety of data types along with excellent data query setup, it lacks phenotypic data analysis tools. Phenotypic data navigation and analysis is a challenging task and the objective of this research project is to address this challenge. 2.2 Materials and methods The underlying objective of the study focuses on trait phenotyping data and so is limited to a subset of CropStoreDB database [URL 1] entities within two groups; Population Data and Trait Data (Figure 2.7). In the CropStoreDB schema [URL 1], the entity-relationship diagram (Figure 2.4) shows each entity which is implemented as a MySQL table, with 15 entities (tables) relevant to the management of phenotypic trait data. 50

CropStoreDB within Brassica implementation.

52 Figure 2.5: Different sections superimposed on entity relationship diagram for the CropStoreDB database. Source: [URL 10] Figure 2.6: CropStore interface of CropStoreDB within Brassica implementation. Source: [URL 11] Figure 2.7: Trait phenotypic data section consists of population and trait data. Source: [URL 10] 51

53 Table 2.2: Description of population data entities from CropStoreDB. Source: [URL 1] Table names Primary key(s) Indexe(s) Unique column(s) plant_varieties plant_variety_name plant_variety_name plant_variety_name plant_variety_detail plant_lines plant_accessions population_type_ lookup plant_variety_name, plant_variety_name, plant_variety_name, data_attribution, data_attribution data_attribution data_provenance plant_line_name plant_line_name, plant_variety_name plant_line_name plant_accession plant_accession, plant_line_name plant_accession population_type population_type population_type plant_populations plant_population_id plant_population_id plant_population_id, population_lists plant_population_id, plant_line_name plant_population_id, plant_line_name plant_population_id Table 2.3: Description of trait data entities from CropStoreDB. Source: [URL 1] Table names Primary key(s) Indexe(s) Unique column(s) plant_trials plant_trial_id plant_trial_id plant_trial_id design_factors design_factor_id design_factor_id. Institute_id design_factor_id plant_parts plant_part plant_part plant_part scoring_unit_id, plant_scoring_unit scoring_unit_id plant_accession, scoring_unit_id scored_plant_part trait_descriptor_id, trait_descriptors trait_descriptor_id category, trait_descriptor_id descriptor_name trait_grades trait_descriptor_id, trait_grade trait_descriptor_id, trait_grade trait_descriptor_id, trait_grade scoring_occasions scoring_occasion_id scoring_occasion_id scoring_occasion_id trait_scores scoring_unit_id scoring_occasion_id trait_descriptor_id, replicate_score_reading scoring_unit_id scoring_occasion_id trait_descriptor_id, replicate_score_reading scoring_unit_id scoring_occasion_id trait_descriptor_id, replicate_score_ reading 52

54 Description of population data and trait data from MySQL CropStoreDB database [URL 1] are displayed in Table 2.2 and Table 2.3 respectively. Each table holds primary key (PK) constraints which uniquely identify each record in CropStoreDB database [URL 1]. Foreign keys (FK) are not implemented. However, complete information is provided in a file that contains referential dependency column describing PK-FK relationships [URL 1]. CropStoreDB database [URL 1] was initially implemented in MySQL s MyISAM which was a default storage engine of traditional relational database MySQL versions prior to 5.5. MyIsam lacked transections, which is why InnoDB is latest default storage engine of MySQL. Current CropStoreDB database [URL 1] structure is defined as a set of DDL (Data Definition Language) commands [URL 12]. It could be implemented in InnoDB or PostgreSQL. InnoDB is speedier and support transections, but subqueries are major weakness especially it doesn t support Full Outer Joins [URL 13]. PostgreSQL is more advanced, powerful and open source database system which outweigh MySQL due to flexibility, appealing platform to store JSON (JavaScript Object Notation) files in the database, fully compliant for two or more levels of sub-queries execution and less hassle with licensing. MySQL caught up with PostgreSQL for many other reasons, such as custom data types, flexible table inheritance, better rules systems, and database events [URL 13]. Latest version of the CropStoreDB [URL 1] schema include a number of geometry columns. PostgreSQL directly offers some rudimentary geometric types that can be well suited for storing geometry columns, however, geometry is not a standard SQL datatype [URL 14]. Retrieval of data directly from the CropStorteDB database [URL 1] and analysis using the web interface is currently slow and time-consuming as a consequence of MyIsam storage engine. In order to address this limitation a master table of trait phenotypic data was constructed from CropStoreDB database [URL 1] which is stored in innodb. A set of queries written in SQL (MySQL) [URL 15] joined 15 interrelated tables of trait phenotypic data to obtain a master table with 50 fields. No literature references could be identified regarding existing data aggregation methods for dynamic analysis from crop genetics data repositories. A fairly small 53

55 number of crop genetics databases offer a platform incorporating data analysis capabilities (Bruskiewich et al. 2003b; Carollo Blake 2013; Uszynski 2015b). Moreover, any description of data aggregation from different data sections of the database prior calling to the web interface for analysis could not be identified. Referred material discuss only what capabilities are offered and how it can be performed (Janssen et al. 2015). However, the details of background data processing is poorly described. In the current project MySQL5.7 [URL 15] SQL statements were used for the construction of the master table, queried from the population and trait data sections of the CropStoreDB database. A materialized view (Gupta and Mumick 1995) was considered the most suitable aggregation of trait phenotypic data in the form of a master table, and to optimise the efficiency of subsequent queries. However, since materialized views are not yet supported in MySQL5.7, the master table was created as a database entity manually to act as a materialized view. In order to join all records, a full join was considered appropriate. However, the full join is also not supported in MySQL5.7. Union and union all are supported in MySQL5.7 and are used to construct fully joined master table. To avoid mistakes in writing a long SQL query to create a master table, 15 interrelated tables of trait phenotypic data are joined together hierarchically. Figure 2.8 and Figure 2.9 shows how population and trait data sections from the CropStoreDB database [URL 1] were joined together to obtain complete sets of trait phenotypic data and associated meta-data in a single master table using the following three-stage algorithm: i. trait data consists of eight interrelated tables. The trait data are joined in Table1 and Table2 (Figure 2.8). Table1 joins plant_trials, plant_parts with plant_scoring_unit. Plant_scoring_unit and plant_trials are fully joined in one table on plant_trail_id which is the primary key (PK) in plant_trial and foreign key (FK) in plant_scoring_unit. This combined table further joined with plant_parts on scored_plant_parts, then a combination of all three tables joined with design_factors on design_factors_id to generate Table1. Similarly, Table2 joins trait_descriptors, trait_grades, trait_scores with scoring_occasions. 54

56 trait_descriptors and trait_scores are fully joined in one table on trait_descriptor_id which is the primary key (PK) in both tables. This combined table further joined with trait_grades on trait_descriptor_id, then a combination of all three tables joined with scoring_occasions on scoring_occasions_id to generate Table2. Table1 and Table2 were joined using common column scoring_unit_id to get Table12, which is the complete trait data section of CropStoreDB database [URL 1]. ii. population data consists of seven interrelated tables. The population data are joined in Table3 and Table4 (Figure 2.8). Table3 joins Plant_variety_detail, plant_varieties, plant_lines with plant_accessions. Plant_variety_detail and plant_varieties are fully joined in one table on plant_variety_name which is the primary key (PK) in both tables. This combined table further joined with plant_lines on plant_variety_name, then a combination of all three tables joined with plant_accessions on plant_line_name to generate Table3. Table4 joins population_type_lookup, plant_populations with population_lists. population_type_lookup and plant_populations are fully joined in one table on population_type which is the primary key (PK) in population_type_lookup and foreign key (FK) in plant_populations. This combined table further joined with population_lists on plant_population_id which is the primary key (PK) in population_lists to generate Table4. Likewise Table12, Table3 and Table4 were joined using common column plant_line_name to get Table34, which is the complete population data section of CropStoreDB database [URL 1]. iii. Table12 and Table34 are joined using common column plant_accessions to create a master table (The PhenDATA table) as shown in Figure 2.8. Hierarchical structure to build master table is displayed in Figure 2.9. The master table is a flat file with multiple NA s (missing values). Data in the master table is in a long format. Each row is one time point per variate or design factor in a long format data, where each variate or design factor will have data in multiple rows and any 55

57 variate or design factor that do not change across time is showing the same value in all the rows [URL 16]. Figure 2.8: Use case diagram to construct the master table (the PhenDATA table). 56

Figure 2.9: Table names are represented in shaded regions, whereas primary key-foreign key relationships are used to join table1, table2, table3 and table4.

58 Figure 2.9: Table names are represented in shaded regions, whereas primary key-foreign key relationships are used to join table1, table2, table3 and table4. Table 12, table 34 and table 1234 are joined using common columns. From left to right, follow the sequence to construct tables (Table1, Table2, Table3, Table4, Table12 and Table34). Table1234 is a master table (the PhenDATA table). 57

59 2.2.1 Data processing issues and solutions Issue 1: While joining two tables in CropStoreDB [URL 1], a full outer join was needed to include all records from both tables, which has the potential to return very large datasets. However, a full join is not supported in MySQL. To overcome this problem, each pair of the table was first left joined, then right joined, and ultimately results from both the left joined and right joined tables were combined using the union all option. Syntax SELECT * FROM table_name1 LEFT JOIN table_name2 ON column_name1 = column_name2 UNION ALL SELECT * FROM table_name1 RIGHT JOIN table_name2 ON column_name1 = column_name2 Issue 2: In the CropStoreDB database [URL 1] new master table, missing score_values were sometimes entered as the string missing or no data or unspecified, etc. This is manual data entry issue, particularly when the vocabulary is not properly defined and controlled. As a result synonyms has been used to describe missing values. To solve this issue, MySQL set of queries (S1.7) were used to convert any relevant missing string value into NULL. Issue 3: Score_values are varchar data type. For analysis purpose, score_values should be an integer rather than varchar. MySQL query (S1.8) is used to handle this issue. All non-numeric entries are also replaced with NULL. Issue 4: An error message ERROR 1114: The table X is full was received while generating the master table. This required a modification in the hosted data limit cap set for all the tables combined in my.cnf for the INNO_DB tables. To address this issue, we selected Administration from MySQL Workbench main menu, then 58

60 Options File and InnoDB, checked the box innodb_autoextend_increment and entered limit 512M and hit apply. Maximum 512MB of data can be hosted in all InnoDB tables combined [URL 17]. Restarted MySQL after making the configuration change. 2.3 Results Overview Data is prepared by fully joining multiple MySQL tables as one entity for downstream analysis. Existing CropStoreDB database [URL 1] implementation on Brassica is used as use-case Object entity A master table of phenotypic data is derived from trait and population data section entities of CropStoreDB database, named as PhenDATA table. In order to generate PhenDATA table, data were prepared by carrying out MySQL full join (see S1, S1.1, S1.2, S1.3, S1.4, S1.5 and S1.6) (see Supplementary Materials). 15 trait phenotypic data tables were joined in small sections and then joined together sequentially (Figure 2.9). The PhenDATA table contained 50 design factors and variates. With the use-case Brassica data, this comprised 434,934 entries which are readily available in CropStoreDB database [URL 1] for web access and Internet-based applications to fulfill the demand of the project. Queried PhenDATA table contains complete phenotypic data of CropStoreDB database [URL 1]. Hierarchical construction of the PhenDATA table is displayed (Figure 2.9) Technical Implementation The PhenDATA table is stored in MySQL as an innodb table. The PhenDATA table is the result of multiple sets of queries, which is a database object equivalent to a materialized view. Materialized view (Gupta and Mumick 1995) is not supported in MySQL but manually created using SQL inbuilt features [URL 18]. In current implementation refresh process is manual. 59

61 2.3.4 Success measurement The PhenDATA table construction is recommended to execute in parts to get complete phenotypic data from CropStoreDB [URL 1]. The execution time to create tables and indexes along with number of retrieved rows are displayed (Table 2.4). Table 2.4: Number of retrieved rows, and tables and indexes creation time to get the PhenDATA table. Table name Retrieved rows Table creation time Indexing time Table seconds 38 seconds Table seconds 33 seconds Table seconds 39 seconds Table seconds 27 seconds Table seconds 72 seconds Table seconds 80 seconds Table1234 (PhenDATA table) seconds - To validate the PhenDATA table, number of distinct score_values are computed before and after joining all 15 tables. Trait_score is the only table which contains column named as score_value and it comprised of 27,616 distinct entries. The PhenDATA table consists of 27,617 distinct entries for the column score_values. One extra distinct entry is NULL, as the PhenDATA table is a flat table which has many null values (2.2 iii). MySQL queries used for the PhenDATA table validation are given as supporting information in the Supplementary Materials (S1.7). Other data preparation tasks are performed after validation. Number of effected row in response to the queries to perform tasks and their execution times are given (Table 2.5). Table 2.5: Other data preparation tasks, effected rows and processing time. Task Effected rows Time Updated table1234 for score_values Text to NULL seconds Deleted rows where score_value=null seconds Added auto increment id column seconds Converted score_values from varchar to decimal seconds 60

62 MySQL queries to perform data preparation tasks are given as supporting information in the Supplementary Materials (S1.8). The PhenDATA table construction took less than 10 minutes for preparation and stored in advance for downstream analysis. 2.4 Discussion The work outlined in this chapter aimed to prepare trait phenotypic data for downstream analysis. This was achieved by joining multiple tables in a hierarchical manner. This approach was adopted to prepare the PhenDATA table which is latter used for all shiny applications (Chapter 3). It was not feasible to run a set of queries to prepare data for each shiny application in an interactive manner. This could be undesirable because data aggregation from multiple MySQL tables would place unnecessary burden on the CPU processor and this would involve extended periods of time to execute. The Shiny server may also become disconnect during prolonged processing time. This approach appeared successful, with as little as 10 minutes required to execute all the queries and to create and store the PhenDATA table as a database entity. This specific task output is measureable, accessible, attainable and readily available for use and refresh process can be performed significantly faster. The PhenDATA table creation is faster due to table indexing prior to the joining of one tables (Figure 2.9). Full join of 15 tables at once took more than one hour. This approach was adopted due to at least 6 times increase in efficiency. A database schema similar to CropStoreDB database [URL 1] can use this approach to achieve similar level of efficiency. Alternatively a set of independently created relational database tables can be joined together in a single master table a using similar approach. During the PhenDATA table construction, a few issues were raised in the MySQL data processing. The issues were handled through available and within MySQL supported functionalities such as, the UNION ALL functionality manipulated the full join option which was not supported in MySQL, data conversion from varchar to numeric to get it ready for statistical analysis in R, modification in the limit cap of InnoDB tables to avoid error messages regarding schema space. 61

63 Crop field and other experimental databases helps users to access only phenotypic data from available diverse data domains, however, this aspect is not properly acknowledged in the scientific literature. Phenotypic data is typically managed and accessed differently in diverse crop plant databases. KDDarT (Uszynski 2015b) is a commercial platform used to organise phenotypic data into experiments. In KDDarT sensors and other devices are used to improve the capture of phenotypic and environmental data and to assist in inventory management. Phenotypic data in KDDarT is stored in a database layer and database connector is used to connect web services layer DAL (Data Access Layer) with phenotypic data from both breeding and natural plant and animal populations. This also offers interactive data analysis tools to their clients but there are no literature references identified to explain how this data is gathered within databases for further analysis. The Triticeae Coordinated Agricultural Project (TCAP) generated data is gathered in the Triticeae Toolbox T3 which is a database schema consisting of wheat, barley and oats databases (Carollo Blake 2013; Muñoz-Amatriaín et al. 2014; Blake et al. 2016). T3 Barley and T3 Wheat presently holds a data on over 230K phenotypes and 140K phenotypes respectively (Carollo Blake 2013). T3 schema allows plant breeders and researchers to integrate, visualise, and interrogate the phenotype and genotype (Blake et al. 2016). Likewise, KDDarT, Triticeae Toolbox (T3) provides user with few phenotypic data analysis tools, however, lack in describing background data processing methods. Germinate (Lee et al. 2005) is a generic framework which organises both standard collection information and passport data along with phenotypic, genotypic and field trial data in MySQL. User can query and export data from Germinate into logical groupings, however, this does not offer any phenotypic data analysis tools. The aim of this project is to provide web access to trait phenotypic data of crop genetics data repositories which has CropStoreDB [URL 1] compatible schema and provide tools to navigate and filter data subsets as well as to make comparisons between and within varieties. Trait phenotypic data in CropStoreDB database [URL 1] is stored in 15 interrelated tables in the MySQL relational database. To intergrate these tables into one web interface several queries need to be run in the background to retrieve all phenotypic data. This has the potential to delay retrieval of data. In 62

64 order to mitigate this the PhenDATA table is prepared to precompute and store complete trait phenotypic data within the CropStoreDB database [URL 1]. The PhenDATA table is a manually created static table like materialized view (MV), but lacks support in the current version of MySQL5.7. The PhenDATA table was found to behave similar to MV in terms of processing time. In MV query execution time decreases at the cost of space (Jogekar and Mohd 2013; Patel and Patel 2015). In the CropStoreDB database storage is not limited (Love et al. 2012). The PhenDATA table is also an adaptive view (Patel and Patel 2015) and unlike the original MV is written according to the demand of the project. The PhenDATA table is also a view maintenance which needs incorporation with the incremental updating of changes in the database, at present refresh process is manual. This data-rich research fulfils FAIR data principles i.e. Findable, Accessible, Interoperable and Reusable (FAIR). Data preparation is both human-driven and machine-driven exercise. The PhenDATA table is one of the entities in CropStoreDB database [URL 1] which fulfills all the principles. There are no barriers to finding explicit metadata information as this is available on the web interface. Accessibility, is the core aim of the project and is required in order provide information to users in an easy and straightforward manner. Interoperable requires that metadata is easily accessible. Metadata contains details on provenance and meets the standard of data collection in MySQL database. Metadata is easily downloadable and can be reused. Data use license information is also provided on the website [URL 1]. Crop genetics data management groups are developing fully functional data-driven sites to offer stable, open source and dynamic platforms to improve crop production and food security (Janssen et al. 2012; Eckes et al. 2017). This project is one of the initiatives to address this, and forms part of a collaboration between Southern Cross University Australia, Earlham Institute, UK [URL 19] and Crops For the Future, Malaysia [URL 4]. This collaboration is part of the DivSeek [URL 20] which has an aim to unlock the potential of crop diversity. The Brassica Information Portal (BIP) is also collaborating with Southern Cross University, Australia to develop a platform for easy phenotypic data access, navigation and analysis. The BIP [URL 6] has used the outcomes of the work described in this chapter to create a master table for 63

65 phenotypic data. The implementation of this platform will be available soon to BIP users. 64

66 CHAPTER 3: R SHINY APPLICATIONS 3.1 Introduction As discussed earlier (Chapter 1), various plant R&D databases are supporting researchers to access quality data from multiple biological domains, e.g. phenotypic data and omics data from genomic, proteomic and metabolomics sources, etc. These data management tools can assist pre-breeding research and development. Versatile database structures are required to provide access to datasets in a reproducible manner (Lee et al. 2005; Love et al. 2012; Eckes et al. 2017). Aiming to follow the FAIR data principles of Findable, Accessible, Interoperable and Reusable (FAIR) (Dillo et al. 2016) has now become a fundamental basis for the development of plant R&D databases and should help increase the reusability and persistence of already published data. Datasets are findable through web interfaces which capture information, such as metadata (Love et al., 2012). However, some users have difficulty in finding the desired dataset and identifying this content is often subject to prior knowledge or guesswork. Ideally, consistent indexing and agreed controlled vocabularies help in making datasets easily findable. Additionally, for each category of datasets, there should be drop down menu with all possible options to help increase the findability. Improving the findability has the potential to attract new researchers and save time for current users. Despite these benefits, there are limited examples of interoperable plant R&D databases and only a subset of these have the ability to exchange and validate information (Lee et al. 2005; Eckes et al. 2017). The ideal framework of plant genetics R&D databases should allow emerging data standards implementation, which reassure data quality, and provide structured formats for describing experiments. Although there are databases which ensure data persistence and ability to use it over a long period of time (Lee et al. 2005; Love et al. 2012), however, not incorporating all FAIR data principles. A number of existing database systems are available online that manage crop genetic resource and experimental data. These include the GRIN-Global, Sol Genomic Network SoyBase DB (SGN) as well as the Panzea DB, etc. The GRIN-Global database is a germplasm resource network which includes various crop plants (Cyr et al. 2009). It was first released in December, 2011 [URL 21] and was developed by 65

67 the ARS of the USDA. The GRIN-Global database platform is used to store and manage information associated with germplasm resources (e.g. seed bank accessions) and to deliver information from various gene banks around the world (Cyr et al. 2009; Postman et al. 2009). Although it includes a curator tool, search tool, admin tool, it is not currently equipped with any analysis tools. The Sol Genomic Network (SGN) for solanaceous crops (Paul et al. 1994), SoyBase DB for soybean (Grant et al. 2010) and Panzea DB for maize (Canaran et al. 2008) are also available for accessing phenotypic and genomic data online, but are also not equipped with online analysis tools. The Arabidopsis Information Portal (ARAPORT) (Krishnakumar et al. 2014) which superseded The Arabidopsis Information Resource (TAIR) (Huala et al. 2001) houses phenotypic and genomic data for the model plant Arabidopsis. TAIR is being superseded by ARAPORT (Krishnakumar et al. 2014) due to less data storage capacity and pre-dates ARAPORT by 13 years at least. The Gramene data resource (Yamazaki and Jaiswal 2005) hosts genetic and genomic data for multiple plant species and facilitates cross-species comparisons at the genomic level and this is based on the Ensembl platform (Youens-Clark et al. 2010), although data queries need either some prior knowledge or guesswork and QTL are not easily visible in CMap Viewer. The Brassica Information Portal (BIP) is dedicated for the management of Brassica experimental data and this includes population and trait scoring information along with genetic maps and QTL. Links are provided to genomic data stored within other external sources (Eckes et al. 2017). Similarly, InterStoreDB was an earlier implementation of Brassica genetic and genomic data (Love et al. 2012) and this resource can also be adapted to other crop plant species. CropStoreDB was one of the core databases of InterStoreDB and stores plant and crop genetics data [URL 1]. The BIP database structure is based on the CropStoreDB schema [URL 22] and its initial content is also inherited from the CropStoreDB for Brassica (Eckes et al. 2017). Crop plant R&D databases provide analysis tools are either not open source (Uszynski 2015b) or do not provide enough help and documentation to understand and implement for new users (Bruskiewich et al. 2003b). The KDDart-Knowledge Discovery and Delivery Art is a commercial data repository for the support of plant and animal breeding and genome profiling. KDDarT provides tools, such as data 66

68 collection, data management, analytical tools, reporting and viewing tool, sensor and automation, but this is only made available to commercial clients (Uszynski 2015b). The Triticeae Toolbox (T3) (Carollo Blake 2013; Blake et al. 2016) uses trait ontology terms, specific trait names as well as descriptor methods and offers both raw and processed information from breeding programs. T3 also offers limited tools for phenotypic data analysis, but it is not linked to QTL or genome data. The International Rice Information System-IRIS (Bruskiewich et al. 2003b) is an integrated data platform for management and integration of diverse data types which hosts both private and public datasets. It also provides analysis tools (McLaren et al. 2005), although the use of the tools is currently not possible due to lack of maintenance. Only three crop plant R&D databases offer trait phenotypic data and analysis tools (Table 3.1). Table 3.1: A list of crop plant R&D databases offers phenotypic data analysis tools. Database KDDarT Crop plants Wheat, Barley, Brassica, etc. Type of analysis Genetic and genomic data analysis tools Software Issues Reference MySQL and data access layer Commercial repository (Uszynski 2015b) Triticeae Toolbox Wheat, Barley Genetic data analysis tools Linux, Apache, MySQL, and PHP Cover specific crop plant (Carollo Blake 2013) IRIS Rice Genetic and genomic data analysis tools ICIS Lack of maintenance (Bruskiewich et al. 2003b) The CropStoreDB database manages genetics data and also facilitates the ability of end users to filter and download datasets along with the metadata (Love et al. 2012). Although it incorporates a wide variety of data types along with data query setup, it currently lacks phenotypic data analysis tools. This project aims to provide users with interactive and easy to use comparative, multivariate and statistical analysis techniques to understand and analyse crop plant phenotypic data from CropStoreDB (Love et al. 2012). In addition, this project aims to provide refined online tools to work alongside existing querying and reporting tools, and thus enable researchers and other end-users to navigate data and perform online and real-time analysis of trait measurements for comparative analysis. 67

69 A wide range of proprietary or open-source tools and software have been developed for assisting with experimental design and assessing crop phenotypic relationships. However, operation of these tools requires adequate training and expertise. We decided to use an open source software R/RStudio [URL 7, 3.3] with its library package Shiny. Shiny is used for refinement of statistical tools (online tools) and development of interactive workflows for plant trait and phenotypic data. R [URL 7] is a free software with more than 11,000 inbuilt packages [URL 45] for statistical analysis and RStudio [URL 46] is an integrated environment for R [URL 7]. RStudio includes a code editor, debugging & visualisation tool and can be used for workspace management. Most desktops and servers are suitable for RStudio. It can also be accessed through the web and integrates powerful coding tools designed to enhance productivity. Tools are refined in the form of dynamic applications with the help of well-built packages such as Shiny along with various other library packages supported in conjunction with R and Shiny. Shiny [URL 47] (Chang 2014) is a powerful RStudio [URL 46] library package which geared R users to develop interactive web pages without prior knowledge of JavaScript/CSS/HTML. Shiny developers are writing R codes to enhance the performance of the apps. Web pages can interact with R and show R objects (data tables, plots, etc.) or any other features we do in R. In this project, four Shiny applications (apps) with distinct features have been developed. The first app named as CS_PhenEXPLORER CropStoreDB trait phenotypic data exploration facilitates data exploration where the user is able to explore available variables and filter sub-variables from the CropStoreDB database [URL 1]. It allows an easy and well-guided way to view available data in the CropStoreDB database [URL 1] without any prior knowledge about the content. Trait phenotypic data within the CropStoreDB database [URL 1] is aggregated and stored in a master table (The PhenDATA table - Chapter 2). The first app facilitates data navigation, filtration, download of data and limits the user s access to only displayed design factors and variates. The second app named as CS_PhenNAVIGATOR CropStoreDB trait phenotypic data navigation and distribution is designed to offer access to other design factors of the CropStoreDB database [URL 1], which were not displayed in the interface. App2 provides 68

70 functionality for data navigation and distribution which allows the user to navigate data and download along with other design factors from the CropStoreDB database [URL 1]. It also displays the overall distribution of navigated results. The third app CropStoreDB trait phenotypic data comparison facilitates the ability of users to carry out comparisons of multiple data distributions. It is a challenging task to analyze data in a master table which is in a long format and with only one numeric column ( score_value ). To facilitate analysis, it was mandatory to translate data from master table to a pivot table, where table entries are numeric e.g. count of score_values, the aggregate of score_values, etc. The fourth app CropStoreDB trait phenotypic data pivot table, visualization and analysis enables the interactive construction of a pivot table, and subsequent visualization and analysis. Collectively these four apps address the objective of the research project which was to develop an analytical tool interface for a generic crop genetics data schema, where statistical tools are refined and deployed in an interactive manner. 3.2 Materials and Methods Shiny apps consists of two parts [URL 23] called 1) UI (User interface) - The UI controls the layout of the application, shows the application to user and tells the Shiny precisely where to place the stuff, such as, input options, tables, plots, etc. 2) Server - The logic of the applications are controlled by server. A server is an instance of a computer program, where client is another computer program [URL 57]. These logics are commands that tells the web page to follow user interactions with the UI layout controls. Shiny applications (apps) discussed in the subsequent section are built for data visualisation, download and analysis, where some controls (features of the apps) allow users to manipulate inputs, tables and plots. The UI controls the layout of the application, shows the application to user and tells the Shiny precisely where to place the stuff, such as, input options, tables, plots, etc., whereas the server is in charge for generating the data tables or plots. Shiny applications with data exploration and statistical analysis techniques are anticipated online tools of this project. 69

71 As background, it is important to introduce a few frequently used concepts in the development and refinement of Shiny apps (Table 3.2). Table 3.2: Frequently used concepts to build Shiny applications are explained. As discussed earlier, a UI (User Interface) controls the layout of the application, and presents the app to user and tells Shiny exactly where things go. shinyui() function gives back the same value that is passed to it and makes sure last expression of code from ui.r is a user interface. Shiny apps use a default and inherent bootstrap grid system for user interfaces. Bootstrap web framework also known as front-end-framework has been used to build web applications. Shiny offers two types of Bootstrap grids, fluid and fixed. The fluid layout has been used to get more control over the layout of the page. Shiny offers a number of options for laying out the components of a web application. fluidpage() function is a component of default and more flexible grid system. Grid layouts can be used anywhere within a fluid page. It is used to create dynamic web pages as it scales their constituents in real time to cover all accessible web browser width. Its layout comprises of rows which successively consist of columns (Figure 3.1). A fluid page layout offers fluid row and fluid column function to achieve more precise control over the location of UI elements. fluidrow() function has been used to create rows to verify their constituents occur on the same line (if the web browser has sufficient width). column() function has been used to include columns within fluidrow function. The columns width within the fluid row add up to 12. The width of 12-unit grid specify the column horizontal space its components should engage. The first parameter to the column function is its width. sidebarlayout() function is an important first step for most applications. This layout allows users to choose inputs inside the bar and offers a large main area 70

72 for output. For all developed Shiny apps, sidebar layout was positioned left (Figure 3.1), which is default position offered by Shiny. Rows and columns are created for sidebar layout by using fluidrow() and column() functions. sidebarpanel() function is purely a UI element which holds input controls which can be passed sequentially to the sidebarlayout. The sidebarpanel s width is an important argument in the sidebarpanel function. User can set sidebarpanel width value out of 12 total units for fluid layouts. Default width i.e. 4, is used for all developed Shiny apps. The light grey column on LHS of the UI, covering 33.3% of the total display is sidebar panel (Figure 3.1). mainpanel() function is used to display output elements. User can set width value out of 12 total units for fluid layouts. Default width i.e. 8, is used for all developed Shiny apps. The white column on RHS of the UI covers 66.7% of the total display is the main panel (Figure 3.1). tabsetpanel() function is used to subdivide user-interface into discrete sections in a Shiny application. This function has been used for both sidebarpanel and mainpanel functions (Figure 3.1). tabpanel() function is passed to tabsetpanel. It is used to form an individual subsection of tabsetpanel function. Many tabpanels can be setup within the tabsetpanel function. A tabset can be a plot, summary or table output, etc. (Figure 3.1). Shiny apps are dynamic in nature. User input response creates on the fly changes in the output. Following approaches are frequently used to make these apps more dynamic. conditionalpanel() function is used in UI and wraps a set of UI elements that dynamically show/hide content depending on the condition. shinyserver() function accepts both input and output parameters and defines the relationship between inputs and outputs. 71

73 Reactive expressions are used to build these apps where R expressions are used as conditions in server.r to assign new output as a new condition in a conditional expression. In this case, conditions used the output values which require serverside calculations, consequently makes apps bit heavy. Checkbox is a simple widget used for logical input control which returns a TRUE when it is checked, and a FALSE otherwise. Radio button is another input control widget used to select an option from a given list of options. Download button is UI element which has the ability to offer on the fly exporting feature used for web browser download. The downloadhandler which is a server function, use to specify the name of the corresponding file and contents. uioutput() function is used in ui.r, whereas renderui() function is used in server.r. Both work in conjunction with each other. The renderui function generates calls to UI function, dynamically create controls depend on users input and make the outputs appear in a pre-set place in the UI. The uioutput function is used to direct Shiny where these controls should be rendered on the web interface. Other rendering functions renderplot() function for dynamic plotting and rendertable() function for dynamic table output are used in server.r and worked just like a renderui function. renderdatatable() function creates dynamic tables with multiple features is also used which uses JavaScript library DataTables without prior knowledge of JavaScript. tableoutput() and datatableoutput() functions are UI elements used to create table output elements. plotoutput() function is another UI element used to create a plot output element. 72

Figure 3.1: A generic structure of Shiny UI along with few basic features is displayed. Full user interface display is a fluid page. Light grey area is sidebar panel and rest is the main panel.

74 Figure 3.1: A generic structure of Shiny UI along with few basic features is displayed. Full user interface display is a fluid page. Light grey area is sidebar panel and rest is the main panel. Sidebar panel is usually for inputs, whereas the main panel is for outputs. Tab set panels can be used in both sidebar and main panels. Within tab set panel more than one tab panels can be designated e.g. inputs and help are two tab panels within tab set panel of sidebar panel and plot, summary and data are three tab panels within tab set panel of the main panel CS_PhenEXPLORER CropStoreDB trait phenotypic data exploration Several R library packages were used for quick data exploration and identify on the fly trends in trait phenotypic data of the CropStoreDB database [URL 1]. From R library Shiny, ggplot2, scales, Hmisc, tidyr, DT, RMySQL, dplyr, DBI, Shinyjs and ShinyBS packages were used. The Shiny (Chang 2014) package was used to develop this dynamic web application. Multiple pre-built widgets of Shiny helped in building this interactive and powerful app. DBI (James 2012) stands for database interface is an R package which was used as a facilitator for communication between R and the CropStoreDB [URL 1] MySQL relational database management system (RDBMS). The DBI (R Database Interface) package was used to connect and disconnect with MySQL statements and make transactions from database objects. DBI split up the connectivity to the database management system into front end and a back end, where Shiny applications have used only the front end API. Application program interface (API) is set of rules which are fixed during the development of any software to follow while communicating with each other (Minnaert et al. 2002). DT (Xie 2015) stands for data tables is one of the R package which was used to display the CropStoreDB [URL 73

75 1] objects as data frames in UI. Multiple features of DT package e.g. sorting, filtering and pagination are very convenient and were used in all apps. RMySQL (James et al. 2012) package is one of the implementations of standard R database interface DBI package (James 2012). RMySQL (James et al. 2012) is a relational database interface accompanied by MySQL driver in R which was used in all Shiny apps to execute MySQL queries in R to make the output accessible dynamically in a web page. Relatively simple and straightforward ggplot2 (Wickham 2016b) (Wickham et al. 2013) (Wickham et al. 2013) package was used for powerful, elegant and complex plotting of data subsets fetched from the CropStoreDB [URL 1]. ggvis (Chang and Wickham 2015) package is an interactive grammar of graphics. It is taking finest parts of ggplot2 package and declaratively describes the rich interactive web graphics in the reactive framework of Shiny. ggplot2 (Wickham et al. 2013) and ggvis (Chang and Wickham 2015) packages were used in Shiny apps for visualisation. Master table of trait phenotypic data within the CropStoreDB database [URL 1] was in wide format and contain only one numeric column that needs to be converted into tidy data which is easy to work with. Tidyr (Wickham 2016c) package was used for reshaping and aggregation of data fetched from the master table (see chapter 2) stored in the CropStoreDB [URL 1]. Tidyr package was used in conjunction with dplyr package (Wickham 2016a) which was for data manipulation and with ggplot2 (Wickham et al. 2013) and ggvis (Chang and Wickham 2015) packages for visualisation. dplyr (Wickham 2016a) package s fast, user-friendly and consistent functions were used for exploratory data analysis and common data manipulation tasks. It offers simple verbs those are functions for data manipulation and translate manipulation thoughts into codes e.g. filter(), select(), etc. are very useful and allow to work with data frame like objects from remote databases e.g. the CropStoreDB [URL 1]. scales (Wickham and Wickham 2016) package was used for easy scaling of graphics system agnostic in Shiny apps. It helped in converting from data values to perceptual properties e.g. generating reader-friendly breaks and labels for tick marks, axes and legend keys distribution across the data range. Hmisc (Harrell Jr and Harrell Jr 2016) library equipped with multiple functions was used for data analysis, data visualisation and plotting, importing datasets and advanced table making. Shinyjs is 74

76 used to call customized JavaScript functions from R (Attali 2016). Within server logic of Shiny application, shinybs is used to add a popover to a Shiny input, such as reset data input [URL 24] CS_PhenNAVIGATOR CropStoreDB trait phenotypic data navigation and distribution The R library packages shiny, ggplot2, tidyr, DT, RMySQL, DBI, dplyr, plotly, shinyjs and shinybs were used for data navigation and navigated data table download with additional design factors and visualisation. Shiny, ggplot2, tidyr, DT, RMySQL, DBI, dplyr, shinyjs and shinybs (3.2.1). The plotly (Sievert et al. 2016) package was used to create interactive web graphic qplot where download, zooming, tooltips and panning option were enabled by default. plotly is a translation of ggplot2 to create interactive and conventional web based graphics. An R command merge.data.frame was used to fetch additional design factors from the CropStoreDB where filtered data frame displayed in UI is merged with the PhenDATA table of the MySQL CropStoreDB [URL 1] database by using operation intersect (inner join) CS_DATACOMP CropStoreDB trait phenotypic data comparison The R library packages shiny, RMySQL, DBI, DT, ggplot2, shinyjs and shinybs (3.2.1) were used for comparison within and between variates and design factors (3.3) of PhenDATA table fetched from CropStoreDB database [URL 1], using multiple plot types, such as Histogram, density plot, boxplot and bar plot. Fast paced and a powerful graphics language for generating elegant and complex plots; the ggplot 2 (Wickham et al. 2013) is used for the conception of trellis plots (i.e., conditioning) and group representations of variables. For easy comparison, multiple factors of each grouped variable represented in distinct colors are a highly appreciable feature in ggplot2. 75

77 3.2.4 CS_DATAVISAN CropStoreDB trait phenotypic data pivot table, visualisation and analysis. Several R library packages shiny, DBI, RMySQL, DT, ggplot2, Hmisc, reshape2, GGally, rpivottable, psych, MASS, rvest, shinyjs and shinybs were used for pivot table construction in an interactive manner with multiple plot options. Shiny, ggplot2, Hmisc, tidyr, DT, RMySQL, dplyr, DBI, shinyjs and shinybs (3.2.1). reshape2 (Wickham 2012; Hadley 2014) is relatively faster and more memory efficient R package which was used for flexibly restructure and aggregate PhenDATA table of the CropStoreDB [URL 1]. GGally (Schloerke et al. 2014) is an extension of the ggplot2 package with several additional functionalities which was used for plotting transformed data from users populated pivot table. GGally was used to decrease the complexity in combining geom i.e. boxplot, bar, density, etc. with transformed data (pivot table). Psych (Revelle and Revelle 2016) is a toolbox of multiple functions primarily for multivariate analysis which was used to describe basic descriptive statistics of user s populated pivot table data or uploaded data. rvest (Wickham 2015) package is about data harvesting that prompt complex operations as sophisticated pipelines consist of simple and easily understood pieces. It was used for data content extraction from the pivot table and download, then for manipulation of HTML by parsing the file with html() function to get data frame for further analysis. rpivottable (Enzo 2016) package is an open source pivot table library which was used for the dynamic construction of pivot table. Its default features facilitate slicing, dicing, dragging and dropping of variables from the PhenDATA table of the CropStoreDB database [URL 1] to UI and multiple graphical displays of the pivot table. All four apps are served as dynamic tools. The features, functionalities and outputs of each app are presented in the results section. 3.3 Results The work outlined in this Chapter aimed to develop interactive applications for downstream analysis of trait phenotypic data. Four apps were developed to address 76

78 different aspects of trait phenotyping data. These apps were implemented in R Shiny. Installation requirement for R/RStudio are provided [URL 25]. An R function is prepared for checking R packages are installed already, if not, automatically install all required packages in one go. The source code of this function can be found in supplementary materials (see S1.8). Shiny apps could be deployed to web hosting servers, such as Shiny server. Shiny server is a platform to host multiple interactive applications with security, tunning and server monitoring [URL 26]. Users can navigate to Shiny applications through the internet with a web browser. We tested Shiny applications locally for Google Chrome and Mozilla Firefox web browsers. The supported browsers for Shiny server [URL 27] are: Google Chrome Mozilla Firefox Safari (accessible through Apple Software Update) Internet Explorer (At present, admin Console is not supported.) Shiny apps also run on windows. We verified apps working capabilities while running locally on Windows 8.1 Enterprise, which is 64-bit Operating System (OS) with 8.00GB RAM. Shiny applications can run on an iphone, Android, or ipad, due to the Shiny s default parts which use bootstrap extensively to support interactive web applications [URL 28], although we did not test Shiny applications on these devices. Deployment of the Shiny applications on Shiny server is only supported on Linux [URL 29]. Window users can run the Shiny server to host dynamic web applications using a technology like VMWare [URL 30] which works by installing RedHat/CentOs [URL 31] or Ubuntu [URL 32] in a virtual machine [URL 29]. Although the number of concurrent users for a Shiny application is limited, this can be optimally configured using performance tuning options. By default, a Shiny application permits 150 concurrent connections across three users on a single application [URL 33]. The two key concepts of variate and design factor are allocated to the table fields of PhenDATA (Chapter 2). These are consistent with nomenclature used for statistical packages such as Genstat. The design factors (DF) are contributing factors for 77

79 defining the experiments. DF are controlled independent variables, whose levels are set by the experimenter [URL 34], such as species. Whereas, the variates are random variables having a numerical value for each participant of a group [URL 35], such as score values of descriptor. The PhenDATA table was used to test the tools. The table consisted of 434,934 records for 50 variates and design factors. The most important variate (traits) descriptor_name and its design factors, i.e. country, plant_population, project_descriptor, species and trial_year with their score values (score_value) were fetched from the PhenDATA table of MySQL CropStoreDB [URL 1] database and these were made available to users on a web page for further navigation and analysis (Table 3.3). Table 3.3: A snapshot of the first 10 entries of selected column of the PhenDATA table fetched from MySQL CropStoreDB database [URL 1]. Source code: S1.10 The Shiny (Chang 2014) package was used to develop a set of dynamic web applications. The interactivity between inputs and outputs in Shiny and a wide range of pre-built widgets helped in building interactive and powerful apps. 78

80 Table 3.4: Master table (The PhenDATA table) data description for each design factors country, plant_population, project_descriptor, species, trial_year, and variate descriptor_name is displayed. Note: CHN, GBR & PRT are China, Great Britain and Portugal respectively. country plant_population project_descriptor Trial_year species descriptors_name CHN', 'GBR', 'PRT' 'napus', 'oleracea', 'spp' 'BnaEC01_01', 'BnaTNDH_00' 'BolDCC_01', 'BolEC03_01', 'BolEC03_02', 'BolEC03_03', 'BolEC03_04' 'AIR3- CT920463', 'IMSORB', 'Mineral analysis', 'OREGIN' '1996', '2003', '2004', '2005', '2006', '2009', 'canopy leaf bagged dry weight', 'canopy leaf bagged fresh weight', 'early leaf bagged dry weight', 'early leaf bagged fresh weight', 'erucic acid content', 'flowering time', 'glucosinolate content', 'host response to Albugo candida', 'host response to Brevicoryne brassicae', 'host response to Peronospora parasitica', 'number of first branch', 'oleic acid content', 'plant height', 'seed mature time', 'seed oil content', 'seed weight', 'seed yield per plant', 'shoot boron content (B)', 'shoot calcium content (Ca)', 'shoot carbon content (C)', 'shoot copper content (Cu)', 'shoot dry weight', 'shoot fresh weight', 'shoot iron content (Fe)', 'shoot magnesium content (Mg)', 'shoot manganese content (Mn)', 'shoot nitrogen content (N)', 'shoot percent dry weight', 'shoot phosphorus content (P)', 'shoot potassium content (K)', 'shoot sodium content (Na)', 'shoot zinc content (Zn)', 'siliquae of main inflorescence', 'siliquae per plant' 79

81 The PhenDATA table is a flat file with multiple NA s (missing values), where data is stored in a long format. In the long format, each row represent a single time point per variate or design factor, where each variate or design factor e.g. species will have data in multiple rows and any variate or design factor that do not change across time has the same value in all the rows, e.g. napus (see Table 3.3). Collectively, trait phenotypic data (the PhenDATA table) in the CropStoreDB for Brassica database [URL 1] covers seven plant populations and three species from three countries across six different years which were collected from 4 different projects, and provide information on 34 different descriptors (Table 3.4) CS_PhenEXPLORER Data exploration in MySQL database is a major challenge. As the number of available crop plant data sets continue to grow the challenge for user to find information that might be of interest to them. The CS_PhenEXPLORER is designed to address this challenge. As an intermediary output, the PhenDATA table from the CropStoreDB [URL 1] has been made available online in a web page for data exploration using several functionalities of Shiny and other R packages supported to work with Shiny Features of CS_PhenEXPLORER The CS_PhenEXPLORER enable user to: 1. Visualise the PhenDATA table of CropStoreDB [URL 1] or upload.csv data file of maximum 30MB. 2. Select variables. There are further slider inputs for numeric variables. 3. Filter variables. 4. Categorize variables and choose bins for numeric variables. 5. Split plot into rows and columns, and aggregate as color by, group by, size by and/or fill by. 6. Visualise data subsets with multiple graph options. 7. Download plot and multiple options to download navigated data subsets. 80

82 Functionalities of CS_PhenEXPLORER The CS_PhenEXPLORER offers a GUI that presents a logical workflow that guides user to select parameters, such as different design factors and/or variates to plot on x and y axes, filter options, additional input options, etc. (Figure 3.2). Data can be explored dynamically using self-explanatory input options. Entire trait phenotypic data section from the CropStoreDB database [URL 1] can be visualised in one go. Sidebar panel input functions allow x and y variables selection option to make changes in x-axis and y-axis of the plot respectively, displayed in the main menu under Plot tab. Further slider inputs are also available for the numeric variables only. A maximum of two variables can be filtered which might not necessarily be the one selected as x or y variables. Categorize tab allow binning and selection of variable treated as categories. A number of bins can be set for the numeric variables by using categorize tab, where user can select variables to treat as categories. Multiple input options are available in the main panel of the UI to enhance the display. Inputs promptly fetch the results from the MySQL CropStoreDB [URL 1] database and the PhenDATA subsets can be visualised as a table and graph. Table 3.5: List of user interactive features of CS_PhenEXPLORER provided as additional inputs. Additional Inputs Options Selection method Points Point size Point transparency Slider Slider Plot types, Points, Lines Point type Drop down menu Jitter _ Radio button Lines size Slider Lines Line transparency Slider Lines type Drop down menu Colour By Drop down menu Colour Group Split Size Fill Column By Size By Group By Row By List of all selected variates and design factors Drop down menu Drop down menu Drop down menu Drop down menu Fill by Drop down menu 81

83 There are some additional input options, such as, Plot type, points and line options, Color Group Split Size Fill options which makes the graph more clear and readable to user. A range of user interactive features are provided (Table 3.5, Table 3.6 and Table 3.7). Additional input options (Table 3.5) for the plot are self-explanatory. Points and Lines Size option will apply only if Size By option in the Color Group Split Size Fill options are set to None. Options to draw reference lines and additional themes (Table 3.6) are provided to enhance the display. Furthermore, demos are provided in the side bar panel to assist user in selecting inputs, download data, visualise plot, etc. Help tab show brief instructions regarding its features and how to run the app properly. Table 3.6: List of user interactive features of Graph options tab of CS_PhenEXPLORER. Graph options Selection method Options Features Log transformation Checkbox Log Y axis Log X axis Log transformation of the x and y variable inputs if required. Makes highly skewed distributions less skewed. Y axis label Blank box Type text Customize the default x labels. X axis label Blank box Type text Customize the default y labels. Background Colour Legend Position Facet Scales Facet Spaces Reference lines Additional Theme Option Dropdown menu Dropdown menu Dropdown menu Dropdown menu Checkbox Checkbox Grey White Dark Grey Left Right Bottom Top Fixed Free_x Free_y Free Fixed Free_x Free_y Free Draw identity line in the plot. Customize the default background colour of the plot. Customize legend position of the plot. Useful in tidying up the display and remove blank scales from the plot. Useful in tidying up the display and remove blank spaces from the plot. Draw horizontal zero line in the plot. Draw vertical line in the plot. Draw horizontal line in the plot. Use ggplot black and white theme for the plot. 82

84 Table 3.7: List of controls, save plot and data download features of CS_PhenEXPLORER. Options Selection method Feature Reset Refresh icon Click refresh icon to reset the Shiny application. Update plot Button Click update plot button to visualise input changes. Save plot Mouse click Right click on the plot and see copy/save plot options. Download Filtered Data Button Filtered data corresponding to the plot can be downloaded in EXCEL as a.csv file by hitting Download Filtered Data button. Data exploration app is generic and can be implemented to any other crop plant database. Test run UI of this application on BIP platform can be visualised [URL 36]. More detailed functionalities are explained in user manual [URL 25] and source code can be found on GitHub [URL 37]. Case study of CS_PhenEXPLORER 1. Choose data input, a user can either use the CropStoreDB [URL 1] (set by default) or upload.csv data file from the local computer. Use case study is analyzing default data which is the PhenDATA table of CropStoreDB [URL 1] database. 2. Choose x and y variables: y variable=descriptor_name x variable=trial_year. 3. Filter variable 1. Select descriptor_name from the drop down menu and select all values (just get rid of NA entry in descriptor_name). Filter variable 2. Select species, napus, oleracea and spp. 4. Set additional input options (Plot types, Points, Lines): Set point size = 2.2 and point type = Set additional input options (Color Group Split Size Fill): Colour By = country, Column Split = species, Size By = project_descriptor Group By = None, Row Split = None, Fill By = None 83

85 6. Set different options provided under the tab Graph Options in sidebar panel. Y axis label = Project Descriptor Names, X axis label = Trial Years Facet Scales = Free (To remove empty scales in the graph) Facet Spaces = Free (To remove empty spaces in the graph) 7. Click update plot button after setting inputs, otherwise plot won t update. 8. Right click on the plot to see copy or save plot options. 9. To visualise data table go to the next tab Data in the main panel. Multiple options to visualise and save data tables are provided. The CS_PhenEXPLORER UI is displayed in Figure 3.2 and explored data trends can be judged in a glimpse from the case study display shown in Figure

86 Figure 3.2: The CS_PhenEXPLORER- UI is displayed. Web page displaying two panels, sidebar panel on left side of the page (light grey) and main panel on the right side of the page. Sidebar panel has three tabs, inputs, graph options and help, whereas the main panel has two tabs, i.e. plot and data. Inputs, such as data selection, variable selection, filter and categories options are provided in sidebar panel. Main panel Plot tab is displaying visual of explored data after setting Colour By, Column Split and Fill By options provided in the main panel of the web page. Superimposed circled numbers present the workflow and help user to understand and reproduce case study output. 85

Figure 3.3: The CS_PhenEXPLORER (Case study output) is displayed. Explored data trends can be judged in a glimpse from this display. The size of the diamond shapes representing project_descriptors.

87 Figure 3.3: The CS_PhenEXPLORER (Case study output) is displayed. Explored data trends can be judged in a glimpse from this display. The size of the diamond shapes representing project_descriptors. Brassica species napus is studied in relatively most recent years mostly in China for the IMSORB project. All three species has been studied in Great Britain. The OREGIN project studied only B. napus in Host response to Peronospora parasitica, Albugo candida and Brevicoryne brassicae were studied quite a long ago in 1996 with no subsequent records in the CropStoreDB for Brassica database [URL 1]. In the UK, B. oleracea was the most frequently studied species in recent years for Mineral Analysis project. 86

88 3.3.2 CS_PhenNAVIGATOR Without any prior knowledge about the CropStoreDB [URL 1] database, user may wish to ask questions from the complete data set (PhenDATA table) to filter traits that differ in terms of their species, trial year or country, etc. The CropStoreDB trait phenotypic data navigation and distribution app is allowing users to navigate and filter data subsets along with distribution plot of the filtered data. Distribution of filtered data identifies on the fly data distribution trends in trait phenotypic data of the CropStoreDB [URL 1]. User can not only filter data but also fetch those design factor from the CropStoreDB [URL 1] which are not displayed in the web interface Features of CS_PhenNAVIGATOR The CS_PhenNAVIGATOR enable user to: 1. Navigate data table. 2. Visualise distribution of the navigated results. Hover over the plot area to get plot information and see multiple options at the top right corner, such as zoom in/out, download plot, etc. 3. Download navigated datasets along with other design factors not displayed in the browser. This application is also fetching additional design factors for navigated results from PhenDATA table stored in MySQL CropStoreDB database [URL 1] Functionalities of CS_PhenNAVIGATOR The CS_PhenNAVIGATOR also offers a GUI that presents a logical workflow that guides user to set column filters for each design factors and variates to plot distribution of filtered data, filtered data download with additional design factors, etc. (Figure 3.4). It is automatically fetching data from the PhenDATA table of CropStoreDB database [URL 1] to Shiny UI. Filtered data distribution plot and data tables are displayed in the main panel of the UI. The PhenDATA table consists of 50 variates and design factors from which 7 most important design factors and variates are displayed in a tabular format. The data table is displaying first 10 entries, however, this number can be extended by using the drop-down menu provided in show entries tab. The UI is dynamic and changes depending according to the choices 87

89 of filters. Search is a global filter for the data table. First few letters in the search box promptly filter the results in the table.t he first row of the data table is dedicated for filter boxes (white boxes in the UI display) for each design factor and variate (Figure 3.4). Filters are allowing numeric and factor variable inputs. Sliders are available only for numeric variables. User can set multiple filter conditions per design factor and variate. Data distribution of score_values is dynamic and changes depending on choices of filters. Moreover, demos are provided in the side bar panel to assist user in setting filters, download filtered data, update plot, etc. Help tab show brief instructions about its features and how to run the app properly. Few other user interactive features are provided (Table 3.8). Table 3.8: List of controls, save plot and data download features of CS_PhenNAVIGATOR. Options Selection method Feature Reset Refresh icon Click refresh icon to reset the Shiny application. Update plot Button Click update plot button to visualise input changes. Save/Download plot Download Filtered Data Mouse click Button Right click on the plot to see copy/save plot options. Hover on the plot to see download plot option. Filtered data download with additional design factors will save data table in EXCEL as.csv file format. Additional 43 design factors of filtered data can also be viewed in downloaded.csv file. Data navigation and distribution app is generic and can be implemented to any other crop plant database. Test run UI of this application on BIP platform can be visualised [URL 38]. More detailed functionalities are explained in user manual [URL 25] and source code can be found on GitHub [URL 39]. 88

90 Case study of CS_PhenNAVIGATOR 1. Filter flowering time from the variate descriptor_name and set another filter napus for design factor species. 2. 6,555 out of 434,934 entries have been filtered where only first 5 entries are displayed in a table in UI. Multiple options to visualise and save data tables are provided. 3. Click update plot button to visualise overall distribution plot of counts of the score_values of the filtered data in UI. Hovering on the plot will allow users to see zooming, tooltips, panning and download option. 4. Filtered data can be downloaded as a.csv file with additional design factors not displayed in UI by clicking Download Filtered Data button provided in side bar panel of the fluid page. Data navigation and distribution UI is displayed in Figure 3.4. Filtered data table and corresponding distribution plot can be visualised from the case study display in Figure

Figure 3.4: CS_PhenNAVIGATOR - UI is displayed. Web page showing two panels, sidebar panel on left side of the page (light grey) and main panel on the right side of the page.

91 Figure 3.4: CS_PhenNAVIGATOR - UI is displayed. Web page showing two panels, sidebar panel on left side of the page (light grey) and main panel on the right side of the page. Sidebar panel has two tabs, data distribution and help, whereas the main panel presenting the distribution of all score_values and trait phenotypic data table. The PhenDATA table is fetched from CropStoreDB [URL 1], it allows global search and offers data filter box for each variate and design factor. User filter selection(s) will promptly update the data table and hit update plot button to visualise updated distribution plot. 90

Figure 3.5: CS_PhenNAVIGATOR - UI output of case study is displayed. After setting two filters napus and flowering time, first 5 entries of filtered data were displayed in a table.

92 Figure 3.5: CS_PhenNAVIGATOR - UI output of case study is displayed. After setting two filters napus and flowering time, first 5 entries of filtered data were displayed in a table. Distribution of the counts of the filtered score_values was shown in the main panel. Download filtered data button was download filtered results in.csv file with additional design factors not displayed in UI. Hover on the plot to see zooming, tooltips, panning and download options. Superimposed circled numbers present the workflow and help user to understand and reproduce case study output. 91

93 3.3.3 CS_DATACOMP Visual representations play an extremely important role in ensuring valuable insights of the data. The CS_DATACOMP offer data comparison using four plot types, i.e. histogram, density plot, boxplot, and bar graph. It will manipulate the data in the background according to user's inputs and update the plot display in the main panel of the UI (Figure 3.6). The histogram is a plot to inspect shape of the underlying filtered data frequency distribution (either normal or skewed) of the score_values. Density plot is continuous version of Histogram since histogram bins are discrete. Density plot total area under curve integrates to 1. The normal distribution is also called symmetrical distribution which is like a bell shape curve where the data is evenly distributed between two tails. Skewness is a departure from symmetry, where longer left or longer right tail indicates negative or positive skewness respectively. The histogram also helps to identify outliers. Outliers are the unusual values in the data lies in extreme ends of the plot. Boxplot is another way of exhibiting the distribution of the data where the basic descriptive statistics, minimum value, first quartile Q1, median, third quartile Q3 and maximum value helps to discover variability in between and outside the upper and lower quartiles. Outliers can also be spotted as separate points before or after the minimum or maximum values, respectively. A bar graph is used to compare groups, where bar heights are proportionate to the number of cases in each group Features of CS_DATACOMP The CS_DATACOMP enable user to: 1. Visualise the Phen DATA table of CropStoreDB [URL 1] database. 2. Dynamically update plot based on UI inputs. Choose inputs: Choose descriptor_name(s) and populate data subset corresponding to choosen descriptor_name(s). Then choose: Variable (Score_value is fixed) Group 92

94 Plot type, such as histogram, density plot, boxplot, and bar graph. 3. Visualise histogram, density plot, boxplot, and bar graph of the selected inputs. 4. 'Show point' option is only for boxplot. Uncheck box to get clear boxes and outliers. 5. Click on the plot area to copy or save the image Functionalities of CS_DATACOMP The CS_DATACOMP offers a straightforward GUI that presents a logical workflow that guides users to select inputs, such as data filter option corresponding to descriptor_name, group option, plot type, etc. to make comparisons of trait phenotypic data (PhenDATA table) of the CropStoreDB [URL 1] (Figure 3.6). The CS_DATACOMP connect UI with the MySQL CropStoreDB database [URL 1]. It is fetching the PhenDATA table in response to user's input options for filtered data subset corresponding to chosen descriptor_name(s), which is displayed in the main panel (Figure 3.6). Filtered data table is used for further inputs, such as variable and group selection. Generically, the app is showing all design factors and variates in the drop down menu for users to select as group. score_value is only suitable variable for a histogram that requires continuous variable for binning. score_value can be grouped with any other variable to show and discover shape (distribution) of the data. Similarly, score_values grouped with any other variable displays clear box plot with outliers when show points option is unchecked. That s why variable selection is fixed i.e. 'score_value', which is the only numeric column in the PhenDATA table. Additionally, demos are provided in the side bar panel to assist user in selecting inputs, visualise and save plots, etc. Help tab show brief instructions regarding its features and how to run the app properly. Few other user interactive features are provided (Table 3.9). 93

95 Table 3.9: List of controls, save plot and data download features of CS_DATACOMP. Options Selection method Feature Reset Refresh icon Click refresh icon to reset the Shiny application. Update plot Button Click update plot button to visualise input changes. Save plot Mouse click Right click on the plot and see copy/save plot options. Download Filtered Data Button Download filtered data subset corresponding to chosen descriptor_name(s) in EXCEL as.csv file format. The CS_DATACOMP is generic and can be implemented to any other crop plant database. Test run UI of this application on BIP platform can be visualised [URL 40]. More detailed functionalities are explained in user manual [URL 25] and source code can be found on GitHub [URL 41]. Case study of CS_DATACOMP 1. Choose descriptor_name(s); flowering time and erucic acid content to populate data subset for further inputs. 2. Choose further inputs from filtered data subset: Variable selection is fixed i.e. score_value. Select grouping variable trial_year. 3. Choose plot type boxplot Uncheck show points box. 4. Click update plot button after setting inputs, otherwise plot won t update. 5. Click on the plot to copy or save the image. 6. Filtered data can be downloaded as a.csv file by clicking Download Filtered Data button given in side bar panel of the fluid page. Data comparison UI is displayed Figure 3.6. Boxplot can be visualised from the case study display as shown in Figure

Figure 3.6: CS_DATACOMP - UI is displayed. Web page showing two panels, sidebar panel on left side of the page (light grey) and main panel on the right side of the page.

96 Figure 3.6: CS_DATACOMP - UI is displayed. Web page showing two panels, sidebar panel on left side of the page (light grey) and main panel on the right side of the page. Sidebar panel has two tabs, inputs and help, whereas the main panel is dedicated for plot display and data table. By default, score_value and trial_year are selected as variable and groups respectively. User can use drop down menu to choose variable, group and plot type. Under current scenario, score_value was fixed for variable selection. Histograms for flowering time in year 2003 and 2004 were compared. Both years have bimodal data distribution data for flowering time was more spread out and higher scored as compared to Further investigations often reveal the reason of bimodal shapes. 95

Figure 3.7: CS_DATACOMP - UI output of case study is displayed. First data subset was filtered for flowering time and erucic acid content. descriptor_name was selected as group for a boxplot.

97 Figure 3.7: CS_DATACOMP - UI output of case study is displayed. First data subset was filtered for flowering time and erucic acid content. descriptor_name was selected as group for a boxplot. Unchecked show points option to get colored boxes without points. Data for erucic acid content was highly negatively skewed, whereas flowering time was slightly positively skewed. Filtered data subset corresponding to selected descriptor_names was used for plotting and displayed in a table. Superimposed circled numbers present the workflow and help user to understand and reproduce case study output. 96

98 3.3.4 CS_DATAVISAN Pivot table construction is data preparation step which makes statistical techniques easy to work with. Data needs to be translated in a right format prior to statistical analysis. The PhenDATA table of the CropStoreDB database [URL 1] is in wide format and contains only one numeric column i.e. score_value. The CS_DATAVISAN first convert the PhenDATA table into a pivot table corresponding to user inputs. UI display pivot table with multiple data aggregation and plotting options in an interactive manner. Statistical summary of user s populated pivot table is provided in the main panel in a separate tab (Figure 3.8) Features of CS_DATAVISAN The CS_DATAVISAN enable user to: 1. Visualise the CropStoreDB [URL 1] or they can upload.csv data file of maximum 30MB. 2. Convert data from long format to wide format as an interactive manner. Choose inputs to populate pivot table: Rows and column by dragging variables from the given list to the blue coloured panels. Each column and row variables can be further filtered by using selection or filter option in the drop down menu. 'Table' view is set which can be replaced by selecting other available options in the drop-down menu. 'Count' as an aggregation method is set which can be replaced by selecting other available options in the drop-down menu. score_value is the only numeric column which can be used to fill the table. Within pivot table, there are multiple graph options e.g. heat maps, bar charts, stacked bar chart, area chart, etc. 3. Save plot. 4. Download pivot table. 5. Visualise statistical summary of the pivot table. 97

99 Functionalities of CS_DATAVISAN The CS_DATAVISAN offers a GUI that presents a logical workflow that guides users to select parameters to construct pivot table, such as selection of design_factors and/or variates as rows and columns of pivot table, plotting options, pivot table aggregation method, etc. This app is mainly for getting data ready for statistical analysis. It allows user to either use pivot table data or uploads their own.csv data file which makes the app more powerful and interesting, and increases the scope of this application. Pivot table construction is self-explanatory and offers multiple options to aggregate long formatted data with several within pivot table plotting options. Plots are ready to publish graphics. Summary statistics of the pivot table can be visualised in a table. Help tab show brief instructions regarding its features and how to run the app properly. In addition, demos are provided in the side bar panel to assist user in selecting parameters for pivot table construction, such as, slice and dice variables, plotting options, aggregation methods, etc. Few other user interactive features are provided (Table 3.10). Table 3.10: List of control, save plot and data download features of CS_DATAVISAN. Options Selection method Feature Reset Refresh icon Click refresh icon to reset the Shiny application. Save plot Mouse click Right click on the plot and see copy/save plot options. Download Pivot Table Button Download pivot table in EXCEL as.csv file format. The CS_DATAVISAN is generic and can be implemented to any other crop plant database. Test run UI of this application on BIP platform can be visualised [URL 42]. More detailed functionalities are explained in user manual [URL 25] and source code can be found on GitHub [URL 43]. 98

100 Case study of CS_DATAVISAN 1. Choose data input, user can either use the CropStoreDB [URL 1] (set by default) or upload.csv data file from the local computer. Use case study is analyzing default data which is the PhenDATA table of CropStoreDB [URL 1] database. 2. Choose parameters to construct the pivot table: Drag and drop species and trial_years as columns and rows respectively (species by trial_year) from the given list of variables to the blue panels to construct pivot table. Keep other options as default. a. Few plotting options are available within the pivot table. Click the drop down menu of the tab given in the top right corner of the pivot table and see few plotting options, e.g. bar chart, heat map, stacked bar chart, etc. Species by trial year bar chart and stacked bar chart were plotted (Figure 3.9 and Figure 3.10). b. Count is set as aggregation method for pivot table construction. c. Go back to Table view before moving to the next tab to avoid encountering an error message. 3. Click Download Pivot Table button in the main panel to visualise pivot table in EXCEL as.csv file format. 4. Next tab Data summary will show basic statistics of pivot table data. Pivot table, visualisation and analysis UI is displayed in Figure 3.8. Bar graph and stacked bar chart can be visualised from the case study display in Figure 3.9 and Figure 3.10 respectively. 99

Figure 3.8: CS_DATAVISAN - UI is displayed. Web page showing sidebar panel on left side of the page (light grey) and main panel on the right side of the page.

101 Figure 3.8: CS_DATAVISAN - UI is displayed. Web page showing sidebar panel on left side of the page (light grey) and main panel on the right side of the page. Sidebar panel has two tabs, inputs and help. The main panel also has two tabs for the pivot table and data summary. Variables from the given list of variables can be dragged and dropped to vertical and horizontal blue coloured panels of the pivot table. First view of the application is only showing total number of records in the PhenDATA table of CropStoreDB database [URL 1].Pivot table can be downloaded as.csv file format by clicking Download Pivot Table button. Table option is selected by default which can be replaced from list of multiple plotting options, such as, heat map, bar chart, stacked bar chart, etc. Count has been set as a method of aggregation to populate pivot table with the score_values. Score_value is the only numeric column in the PhenDATA table. Data summary tab offers statistical summary of the pivot table. Superimposed circled numbers present the workflow and help user to understand and reproduce case study output. 100

102 Figure 3.9: CS_DATAVISAN - Case study output is displayed. Within pivot table bar graph of counts vs species by trial_year indicates that napus is most studied species particularly in Figure 3.10: CS_DATAVISAN - Case study output is displayed. Within pivot table, stacked bar chart of counts vs species by trial_year indicates that almost similar amount of work has been in done in olerecea each year from 2003 to Overall, napus is most studied species. 101

103 3.4 Analysis tool interface for Shiny apps integration The conceptual framework of the analysis tool interface for deployment of the apps (Figure 3.11) enables users to interact with an existing CropStoreDB web page [URL 1] and analyse trait phenotypic data that are prepared and stored as a master table (the PhenDATA table) in the CropStoreDB database [URL 1]. There interface design allows five tabs to be presented in user interface display, such as, Home, CS_PhenEXPLORER, CS_PhenNAVIGATOR, CS_DATACOMP and CS_DATAVISAN where each tab is based on one Shiny application. The main page includes a brief description of the capabilities of the apps (Figure 3.11). By clicking on CS_PhenEXPLORER, CS_PhenNAVIGATOR, CS_DATACOMP and CS_DATAVISAN tabs will open user interfaces shown in Figure 3.2, Figure 3.4, Figure 3.6 and Figure 3.8 respectively. The CS_PhenEXPLORER tab provides an easy and well-guided way to see what s available in the CropStoreDB. User is then able to explore available variates and design factors and filter sub-variates and sub-design factors without any prior knowledge. The CS_PhenNAVIGATOR tab will facilitate users being able to carry out interactive data filtration, visualisation of filtered data, and allow filtered data download as.csv file, along with additional design factors. The CS_DATACOMP tab will help users to visualise and compare multiple distributions. The CS_DATAVISAN tab will allow an interactive construction of pivot table, visualisation and analysis. 102

phenotypic data section. The PhenDATA table is used in all apps.

104 Figure 3.11: Conceptual framework of data sharing, visualisation and publishing system after deployment of Shiny apps is displayed. The PhenDATA table is prepared from the CropStoreDB database [URL 1] trait phenotypic data section. The PhenDATA table is used in all apps. Analysis tool interface will be part of existing CropStoreDB database [URL 1] where each tab CS_PhenEXPLORER, CS_PhenNAVIGATOR, CS_DATACOMP and CS_DATAVISAN based on distinct Shiny apps. 103

Breeding Guide. Customer Services PHENOME-NETWORKS 4Ben Gurion Street, 74032, Nes-Ziona, Israel

Breeding Guide. Customer Services PHENOME-NETWORKS 4Ben Gurion Street, 74032, Nes-Ziona, Israel Breeding Guide Customer Services PHENOME-NETWORKS 4Ben Gurion Street, 74032, Nes-Ziona, Israel www.phenome-netwoks.com Contents PHENOME ONE - INTRODUCTION... 3 THE PHENOME ONE LAYOUT... 4 THE JOBS ICON...