Content-based Comparison for Collections Identification

Size: px

Start display at page:

Download "Content-based Comparison for Collections Identification"

Tiffany Wilcox
6 years ago
Views:

1 Content-based Comparison for Collections Identification Weijia Xu1, Ruizhu Huang1, Maria Esteva1, Jawon Song1, Ramona Walls2, 1 Texas Advanced Computing Center, University of Texas at Austin 2 Cyverse.org IEEE BigData 16, Workshop on Computational Archival Science Dec 8, Washington D.C. 1

2 Panorama Bio data collections evolve in a redundant, unstable, and distributed research environment Data is big, has multiple components at different stages of completion and may be stored across repositories Pre and post data publication events are difficult to document Auto-archiving is now an ubiquitous practice for research data Metadata is not enough to establish data uniqueness 2

3 Identifier Services (IDS) Research Project Automated lifecycle identifiers management Use-case driven Focus on genomics data IDS services to: Bind dispersed data objects Track and represent provenance Validate data location and integrity over time Aid data identity Cyberinfrastructure to: Deal with large data/metadata Deal with evolving data over time Deal with big data tasks in a distributed environment DATA AUTHENTICITY 3

4 IDS Architecture User 1 User registers repository(s) data, providing access mechanism(s). User selects files from a registered storage system Public Cloud IDS Web idenhfierservices.org 2 IDS queues corresponding Agave apps. 4 Agave Tenant Agave API Apps API Jobs API Systems API 3 HPC System Data app pulls data from repository, computes analysis on high performance computing system. IDS updates metadata with data analysis results. Files API Metadata API... Data Repository Data Repository Data Repository User User User User User 4

5 Content-based Comparison Service: Motivation To identify changes, connections, and differences between datasets Infer issues of provenance Establish data identity Promote data reuse How is the uniqueness of the datasets determined? Data curators mostly use manual processes Rely on metadata Data description File checksums/fixity 5

6 Questions If two datasets share the same metadata, are they the same dataset? If two similar or identical datasets have different metadata, are they two different datasets? In any of the cases above, how can curators apply global unique identifiers and corresponding metadata? 6

7 Content-based Data Comparison Goals: Provide automated methods to verify data identity Provide additional information regarding two or more data collections Provenance: documentation of data origin and changes Increase data reuse Challenges: Diverse data formats E.g. fasta vs. fastq, vs SRA formats Different naming conventions/identifier structures Performance and scalability Tens of gigabytes on disk Includes millions of data records per dataset (all pairwise record comparisons are not feasible) 7

Algorithm Overview Collection A Collection B Collections analysis: determine records in each collection Records list in A Records list in B Records analysis: determine best pairs of records for

8 Algorithm Overview Collection A Collection B Collections analysis: determine records in each collection Records list in A Records list in B Records analysis: determine best pairs of records for comparison Pairs of records from each collection Records comparison: compare records for each pair Comparison report Determine the composition of the collections to be compared Decide record readers: gff, fastq, fasta List record pairs for comparison 3 to 40 Gb file sizes ~19 mil to 165 mil records to compare Compare all the records from the lists using Spark framework Report results 8

9 Algorithm Comparison works as follows: Convert list of records as (ID, value) pairs, Sort the records based on the IDs Start with a, b as the pointer to the head of list A, B respectively according to a score function. If score(a, b)<=0, record results and move a forward, If score(a, b) >0, record results and move b forward. Different distance functions can be used to compare records: Hamming distance, edit distance, prefix distance etc. 9

10 Presentation of Results for Decision-making Number of records identified and compared Number of matched keys (id) and/or values (sequence) between collections Numbers of records that only exist in one of the collections (A or B) Degree of similarity of records as a histogram of the match score distribution 10

11 Case 1 The dataset: Rice genomic variations Two copies (105 gff files) available at: Cyverse Data Commons Direct connection to HPC resource for data analysis Dryad Integrate data to journal publication At a glance Each repository has a different functionality Metadata does not match exactly Different formats: compressed file vs. individual files Per content-based comparison both datasets are identical Metadata record in Cyverse will be updated to reflect the relationship between the datasets 11

12 Case 2 A research group wants to publish a complete dataset resulting from the analysis of five maize lines The input data, whole genome bisulfite FASTQ sequencing files, have been published via SRA (Sequence Read Archive) At a glance, The working copy consists of two FastaQ format files Researchers think that the working copy is also available from SRA, 20.7 GB. ( run=srr850328) 12

that the difference between the datasets was not significant; both trimmed and un-trimmed collections could be considered the same work.

13 Results from Case 2 Set A: 165 million sequences from working copy Set B: 167 million sequences from archived data with SRA Through the results the researchers were able to infer that the working copy had been processed by adaptive trimming (provenance) They concluded that the difference between the datasets was not significant; both trimmed and un-trimmed collections could be considered the same work. Both datasets will have different unique identifiers, clarifications in their metadata, and identifiers will be related in the metadata 13

14 Case 3 Two datasets published with almost identical metadata Datasets are referenced as At a glance Same datasets organization 12 of 14 metadata fields have identical information No provenance information or stated relationship between both datasets 14

15 Results from Case 3 Set A: solexa reads of DDEV from 1kp (13 million records) Set B: solexa reads of IFCJ from 1kp (18 million records) Prefix match score: e.g. abcde and abdc, match score 2*2/(5+4) = 0.44 Researchers could infer that: One the plant samples may be contaminated DistribuHon of prefix match score Beginning 20% segments match may be due to the use of the same sequencing primer 15

16 Performance and Scalability All computations were conducted in the data intensive system Wrangler at TACC Implementation using Spark data processing framework Data format detection using API in biopython package mpiblast takes hours instead of minutes Execution Time in Seconds nodes 8 nodes 0 Data reading Records match Pre8ix check ExecuHon Hme in comparing two sequencing files in use case 3 16

17 Conclusions Identity and provenance of data has to be managed over time as copies and versions of a dataset are generated, reused and published Metadata alone may not be enough to determine uniqueness of the data Metadata needs to be updated and relationships between datasets need to be clarified Determining data uniqueness requires content-based comparison Comparison results help curators understand the reasons for differences and similarities More automated scalable computation services are needed for data curation 17

18 Thanks & Questions? Acknowledgement This work is supported through funding provided by the National Science Foundation from the following project: Evaluating Identifier Services for the Lifecycle of Biological Data (# ) The iplant Collaborative: Cyberinfrastructure for the Life Sciences (# , data and use cases) Wrangler: A Transformational Data Intensive Resource for the Open Science Community (# , computational resource support) 18

Content-based Comparsion for Collections Identification

Content-based Comparsion for Collections Identification Weijia Xu 1 Ruizhu Huang 1 Maria Esteva 1 Jawon Song 1 Ramona Walls 2 1 Texas Advanced Computing Center, University of Texas at Austin 2 Cyverse.org