Content-based Comparison for Collections Identification
|
|
- Tiffany Wilcox
- 6 years ago
- Views:
Transcription
1 Content-based Comparison for Collections Identification Weijia Xu1, Ruizhu Huang1, Maria Esteva1, Jawon Song1, Ramona Walls2, 1 Texas Advanced Computing Center, University of Texas at Austin 2 Cyverse.org IEEE BigData 16, Workshop on Computational Archival Science Dec 8, Washington D.C. 1
2 Panorama Bio data collections evolve in a redundant, unstable, and distributed research environment Data is big, has multiple components at different stages of completion and may be stored across repositories Pre and post data publication events are difficult to document Auto-archiving is now an ubiquitous practice for research data Metadata is not enough to establish data uniqueness 2
3 Identifier Services (IDS) Research Project Automated lifecycle identifiers management Use-case driven Focus on genomics data IDS services to: Bind dispersed data objects Track and represent provenance Validate data location and integrity over time Aid data identity Cyberinfrastructure to: Deal with large data/metadata Deal with evolving data over time Deal with big data tasks in a distributed environment DATA AUTHENTICITY 3
4 IDS Architecture User 1 User registers repository(s) data, providing access mechanism(s). User selects files from a registered storage system Public Cloud IDS Web idenhfierservices.org 2 IDS queues corresponding Agave apps. 4 Agave Tenant Agave API Apps API Jobs API Systems API 3 HPC System Data app pulls data from repository, computes analysis on high performance computing system. IDS updates metadata with data analysis results. Files API Metadata API... Data Repository Data Repository Data Repository User User User User User 4
5 Content-based Comparison Service: Motivation To identify changes, connections, and differences between datasets Infer issues of provenance Establish data identity Promote data reuse How is the uniqueness of the datasets determined? Data curators mostly use manual processes Rely on metadata Data description File checksums/fixity 5
6 Questions If two datasets share the same metadata, are they the same dataset? If two similar or identical datasets have different metadata, are they two different datasets? In any of the cases above, how can curators apply global unique identifiers and corresponding metadata? 6
7 Content-based Data Comparison Goals: Provide automated methods to verify data identity Provide additional information regarding two or more data collections Provenance: documentation of data origin and changes Increase data reuse Challenges: Diverse data formats E.g. fasta vs. fastq, vs SRA formats Different naming conventions/identifier structures Performance and scalability Tens of gigabytes on disk Includes millions of data records per dataset (all pairwise record comparisons are not feasible) 7
8 Algorithm Overview Collection A Collection B Collections analysis: determine records in each collection Records list in A Records list in B Records analysis: determine best pairs of records for comparison Pairs of records from each collection Records comparison: compare records for each pair Comparison report Determine the composition of the collections to be compared Decide record readers: gff, fastq, fasta List record pairs for comparison 3 to 40 Gb file sizes ~19 mil to 165 mil records to compare Compare all the records from the lists using Spark framework Report results 8
9 Algorithm Comparison works as follows: Convert list of records as (ID, value) pairs, Sort the records based on the IDs Start with a, b as the pointer to the head of list A, B respectively according to a score function. If score(a, b)<=0, record results and move a forward, If score(a, b) >0, record results and move b forward. Different distance functions can be used to compare records: Hamming distance, edit distance, prefix distance etc. 9
10 Presentation of Results for Decision-making Number of records identified and compared Number of matched keys (id) and/or values (sequence) between collections Numbers of records that only exist in one of the collections (A or B) Degree of similarity of records as a histogram of the match score distribution 10
11 Case 1 The dataset: Rice genomic variations Two copies (105 gff files) available at: Cyverse Data Commons Direct connection to HPC resource for data analysis Dryad Integrate data to journal publication At a glance Each repository has a different functionality Metadata does not match exactly Different formats: compressed file vs. individual files Per content-based comparison both datasets are identical Metadata record in Cyverse will be updated to reflect the relationship between the datasets 11
12 Case 2 A research group wants to publish a complete dataset resulting from the analysis of five maize lines The input data, whole genome bisulfite FASTQ sequencing files, have been published via SRA (Sequence Read Archive) At a glance, The working copy consists of two FastaQ format files Researchers think that the working copy is also available from SRA, 20.7 GB. ( run=srr850328) 12
13 Results from Case 2 Set A: 165 million sequences from working copy Set B: 167 million sequences from archived data with SRA Through the results the researchers were able to infer that the working copy had been processed by adaptive trimming (provenance) They concluded that the difference between the datasets was not significant; both trimmed and un-trimmed collections could be considered the same work. Both datasets will have different unique identifiers, clarifications in their metadata, and identifiers will be related in the metadata 13
14 Case 3 Two datasets published with almost identical metadata Datasets are referenced as At a glance Same datasets organization 12 of 14 metadata fields have identical information No provenance information or stated relationship between both datasets 14
15 Results from Case 3 Set A: solexa reads of DDEV from 1kp (13 million records) Set B: solexa reads of IFCJ from 1kp (18 million records) Prefix match score: e.g. abcde and abdc, match score 2*2/(5+4) = 0.44 Researchers could infer that: One the plant samples may be contaminated DistribuHon of prefix match score Beginning 20% segments match may be due to the use of the same sequencing primer 15
16 Performance and Scalability All computations were conducted in the data intensive system Wrangler at TACC Implementation using Spark data processing framework Data format detection using API in biopython package mpiblast takes hours instead of minutes Execution Time in Seconds nodes 8 nodes 0 Data reading Records match Pre8ix check ExecuHon Hme in comparing two sequencing files in use case 3 16
17 Conclusions Identity and provenance of data has to be managed over time as copies and versions of a dataset are generated, reused and published Metadata alone may not be enough to determine uniqueness of the data Metadata needs to be updated and relationships between datasets need to be clarified Determining data uniqueness requires content-based comparison Comparison results help curators understand the reasons for differences and similarities More automated scalable computation services are needed for data curation 17
18 Thanks & Questions? Acknowledgement This work is supported through funding provided by the National Science Foundation from the following project: Evaluating Identifier Services for the Lifecycle of Biological Data (# ) The iplant Collaborative: Cyberinfrastructure for the Life Sciences (# , data and use cases) Wrangler: A Transformational Data Intensive Resource for the Open Science Community (# , computational resource support) 18
Content-based Comparsion for Collections Identification
Content-based Comparsion for Collections Identification Weijia Xu 1 Ruizhu Huang 1 Maria Esteva 1 Jawon Song 1 Ramona Walls 2 1 Texas Advanced Computing Center, University of Texas at Austin 2 Cyverse.org
More informationScience-as-a-Service
Science-as-a-Service The iplant Foundation Rion Dooley Edwin Skidmore Dan Stanzione Steve Terry Matthew Vaughn Outline Why, why, why! When duct tape isn t enough Building an API for the web Core services
More informationWriting a Data Management Plan A guide for the perplexed
March 29, 2012 Writing a Data Management Plan A guide for the perplexed Agenda Rationale and Motivations for Data Management Plans Data and data structures Metadata and provenance Provisions for privacy,
More informationThe iplant Data Commons
The iplant Data Commons Using irods to Facilitate Data Dissemination, Discovery, and Reproducibility Jeremy DeBarry, jdebarry@iplantcollaborative.org Tony Edgin, tedgin@iplantcollaborative.org Nirav Merchant,
More informationPre-Workshop Training materials to move you from Data to Discovery. Get Science Done. Reproducibly.
Pre-Workshop Packet Training materials to move you from Data to Discovery Get Science Done Reproducibly Productively @CyVerseOrg Introduction to CyVerse... 3 What is Cyberinfrastructure?... 3 What to do
More informationirods at TACC: Secure Infrastructure for Open Science Chris Jordan
irods at TACC: Secure Infrastructure for Open Science Chris Jordan What is TACC? Texas Advanced Computing Center Cyberinfrastructure Resources for Open Science University of Texas System 9 Academic, 6
More informationWelcome to the CyVerse Data Store. Manage and share your data across all CyVerse pla8orms
Welcome to the CyVerse Data Store Manage and share your data across all CyVerse pla8orms ü Follow Along ü Workshop Packet ü mcbios.readthedocs.org Logis;cs Working with Big Data What? Challenges: the scope
More informationGlobus Platform Services for Data Publication. Greg Nawrocki University of Chicago & Argonne National Lab GeoDaRRS August 7, 2018
Globus Platform Services for Data Publication Greg Nawrocki greg@globus.org University of Chicago & Argonne National Lab GeoDaRRS August 7, 2018 Outline Globus Overview Globus Data Publication v1 Lessons
More informationReproducible & Transparent Computational Science with Galaxy. Jeremy Goecks The Galaxy Team
Reproducible & Transparent Computational Science with Galaxy Jeremy Goecks The Galaxy Team 1 Doing Good Science Previous talks: performing an analysis setting up and scaling Galaxy adding tools libraries
More informationCloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018
Cloud Computing 2 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning
More informationDecrypting your genome data privately in the cloud
Decrypting your genome data privately in the cloud Marc Sitges Data Manager@Made of Genes @madeofgenes The Human Genome 3.200 M (x2) Base pairs (bp) ~20.000 genes (~30%) (Exons ~1%) The Human Genome Project
More informationACCI Recommendations on Long Term Cyberinfrastructure Issues: Building Future Development
ACCI Recommendations on Long Term Cyberinfrastructure Issues: Building Future Development Jeremy Fischer Indiana University 9 September 2014 Citation: Fischer, J.L. 2014. ACCI Recommendations on Long Term
More informationThe Materials Data Facility
The Materials Data Facility Ben Blaiszik (blaiszik@uchicago.edu), Kyle Chard (chard@uchicago.edu) Ian Foster (foster@uchicago.edu) materialsdatafacility.org What is MDF? We aim to make it simple for materials
More informationInge Van Nieuwerburgh OpenAIRE NOAD Belgium. Tools&Services. OpenAIRE EUDAT. can be reused under the CC BY license
Inge Van Nieuwerburgh OpenAIRE NOAD Belgium Tools&Services OpenAIRE EUDAT can be reused under the CC BY license Open Access Infrastructure for Research in Europe www.openaire.eu Research Data Services,
More informationCopyright 2014 Regents of the University of Minnesota
Quality Control of Illumina Data using Galaxy August 18, 2014 Contents 1 Introduction 2 1.1 What is Galaxy?..................................... 2 1.2 Galaxy at MSI......................................
More informationLightweight Streaming-based Runtime for Cloud Computing. Shrideep Pallickara. Community Grids Lab, Indiana University
Lightweight Streaming-based Runtime for Cloud Computing granules Shrideep Pallickara Community Grids Lab, Indiana University A unique confluence of factors have driven the need for cloud computing DEMAND
More informationMAPR DATA GOVERNANCE WITHOUT COMPROMISE
MAPR TECHNOLOGIES, INC. WHITE PAPER JANUARY 2018 MAPR DATA GOVERNANCE TABLE OF CONTENTS EXECUTIVE SUMMARY 3 BACKGROUND 4 MAPR DATA GOVERNANCE 5 CONCLUSION 7 EXECUTIVE SUMMARY The MapR DataOps Governance
More informationPerformance analysis of parallel de novo genome assembly in shared memory system
IOP Conference Series: Earth and Environmental Science PAPER OPEN ACCESS Performance analysis of parallel de novo genome assembly in shared memory system To cite this article: Syam Budi Iryanto et al 2018
More informationJetstream: Adding Cloud-based Computing to the National Cyberinfrastructure
Jetstream: Adding Cloud-based Computing to the National Cyberinfrastructure funded by the National Science Foundation Award #ACI-1445604 Matthew Vaughn(@mattdotvaughn) ORCID 0000-0002-1384-4283 Director,
More informationLeveraging Software-Defined Storage to Meet Today and Tomorrow s Infrastructure Demands
Leveraging Software-Defined Storage to Meet Today and Tomorrow s Infrastructure Demands Unleash Your Data Center s Hidden Power September 16, 2014 Molly Rector CMO, EVP Product Management & WW Marketing
More informationApplying Auto-Data Classification Techniques for Large Data Sets
SESSION ID: PDAC-W02 Applying Auto-Data Classification Techniques for Large Data Sets Anchit Arora Program Manager InfoSec, Cisco The proliferation of data and increase in complexity 1995 2006 2014 2020
More informationWhat is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?
Simple to start What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed? What is the maximum download speed you get? Simple computation
More informationCloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018
Cloud Computing 3 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning
More informationScientific Computing Without the Command Line: Enabling Any HPC Code to Run Anywhere through a Web Interface with the Agave API
Scientific Computing Without the Command Line: Enabling Any HPC Code to Run Anywhere through a Web Interface with the Agave API Kathy Traxler, Steven R. Brandt Department of Computer Science Center for
More informationCopyright 2014 Regents of the University of Minnesota
Quality Control of Illumina Data using Galaxy Contents September 16, 2014 1 Introduction 2 1.1 What is Galaxy?..................................... 2 1.2 Galaxy at MSI......................................
More informationGalaxy Platform For NGS Data Analyses
Galaxy Platform For NGS Data Analyses Weihong Yan wyan@chem.ucla.edu Collaboratory Web Site http://qcb.ucla.edu/collaboratory Collaboratory Workshops Workshop Outline ü Day 1 UCLA galaxy and user account
More informationTopics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples
Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?
More informationOmega: an Overlap-graph de novo Assembler for Metagenomics
Omega: an Overlap-graph de novo Assembler for Metagenomics B a h l e l H a i d e r, Ta e - H y u k A h n, B r i a n B u s h n e l l, J u a n j u a n C h a i, A l e x C o p e l a n d, C h o n g l e Pa n
More informationHPC in Cloud. Presenter: Naresh K. Sehgal Contributors: Billy Cox, John M. Acken, Sohum Sohoni
HPC in Cloud Presenter: Naresh K. Sehgal Contributors: Billy Cox, John M. Acken, Sohum Sohoni 2 Agenda What is HPC? Problem Statement(s) Cloud Workload Characterization Translation from High Level Issues
More informationModern Data Warehouse The New Approach to Azure BI
Modern Data Warehouse The New Approach to Azure BI History On-Premise SQL Server Big Data Solutions Technical Barriers Modern Analytics Platform On-Premise SQL Server Big Data Solutions Modern Analytics
More informationApplying Archival Science to Digital Curation: Advocacy for the Archivist s Role in Implementing and Managing Trusted Digital Repositories
Purdue University Purdue e-pubs Libraries Faculty and Staff Presentations Purdue Libraries 2015 Applying Archival Science to Digital Curation: Advocacy for the Archivist s Role in Implementing and Managing
More informationThe CEDA Archive: Data, Services and Infrastructure
The CEDA Archive: Data, Services and Infrastructure Kevin Marsh Centre for Environmental Data Archival (CEDA) www.ceda.ac.uk with thanks to V. Bennett, P. Kershaw, S. Donegan and the rest of the CEDA Team
More informationComprehensive Data Infrastructure for Plant Bioinformatics
Comprehensive Data Infrastructure for Plant Bioinformatics Chris Jordan and Dan Stanzione Texas Advanced Computing Center The University of Texas at Austin Austin, Texas, United States ctjordan@tacc.utexas.edu,
More informationThe library s role in promoting the sharing of scientific research data
The library s role in promoting the sharing of scientific research data Katherine Akers Biomedical Research/Research Data Specialist Shiffman Medical Library Wayne State University Funding agency requirements
More informationLASDA: an archiving system for managing and sharing large scientific data
LASDA: an archiving system for managing and sharing large scientific data JEONGHOON LEE Korea Institute of Science and Technology Information Scientific Data Strategy Lab. 245 Daehak-ro, Yuseong-gu, Daejeon
More informationTHE NATIONAL DATA SERVICE(S) & NDS CONSORTIUM A Call to Action for Accelerating Discovery Through Data Services we can Build Ed Seidel
THE NATIONAL DATA SERVICE(S) & NDS CONSORTIUM A Call to Action for Accelerating Discovery Through Data Services we can Build Ed Seidel National Center for Supercomputing Applications University of Illinois
More informationITMO Ecole de Bioinformatique Hands-on session: smallrna-seq N. Servant 21 rd November 2013
ITMO Ecole de Bioinformatique Hands-on session: smallrna-seq N. Servant 21 rd November 2013 1. Data and objectives We will use the data from GEO (GSE35368, Toedling, Servant et al. 2011). Two samples were
More informationAmazon Web Services Cloud Computing in Action. Jeff Barr
Amazon Web Services Cloud Computing in Action Jeff Barr jbarr@amazon.com Who am I? Software development background Programmable applications and sites Microsoft Visual Basic and.net Teams Startup / venture
More informationIntroduction to Grid Computing
Milestone 2 Include the names of the papers You only have a page be selective about what you include Be specific; summarize the authors contributions, not just what the paper is about. You might be able
More informationWeb of Science. Platform Release Nina Chang Product Release Date: March 25, 2018 EXTERNAL RELEASE DOCUMENTATION
Web of Science EXTERNAL RELEASE DOCUMENTATION Platform Release 5.28 Nina Chang Product Release Date: March 25, 2018 Document Version: 1.0 Date of issue: March 22, 2018 RELEASE OVERVIEW The following features
More informationLarge Scale Remote Interactive Visualization
Large Scale Remote Interactive Visualization Kelly Gaither Director of Visualization Senior Research Scientist Texas Advanced Computing Center The University of Texas at Austin March 1, 2012 Visualization
More informationRenovating your storage infrastructure for Cloud era
Renovating your storage infrastructure for Cloud era Nguyen Phuc Cuong Software Defined Storage Country Sales Leader Copyright IBM Corporation 2016 2 Business SLAs Challenging Traditional Storage Approaches
More informationSEAD Data Services. Jim Best Practices in Data Infrastructure Workshop. Cooperative agreement #OCI
SEAD Data Services Jim Myers(myersjd@umich.edu), Best Practices in Data Infrastructure Workshop Cooperative agreement #OCI0940824 SEAD: Sustainable Environment - Actionable Data An NSF DataNet project
More informationMetadata Ingestion and Processinng
biomedical and healthcare Data Discovery Index Ecosystem Ingestion and Processinng Jeffrey S. Grethe, Ph.D. 2017 BioCADDIE All Hands Meeting prototype Ingestion Indexing Repositories Ingestion ElasticSearch
More informationWHITE PAPER: TOP 10 CAPABILITIES TO LOOK FOR IN A DATA CATALOG
WHITE PAPER: TOP 10 CAPABILITIES TO LOOK FOR IN A DATA CATALOG The #1 Challenge in Successfully Deploying a Data Catalog The data cataloging space is relatively new. As a result, many organizations don
More informationEarthCube and Cyberinfrastructure for the Earth Sciences: Lessons and Perspective from OpenTopography
EarthCube and Cyberinfrastructure for the Earth Sciences: Lessons and Perspective from OpenTopography Christopher Crosby, San Diego Supercomputer Center J Ramon Arrowsmith, Arizona State University Chaitan
More informationTACC s Stampede Project: Intel MIC for Simulation and Data-Intensive Computing
TACC s Stampede Project: Intel MIC for Simulation and Data-Intensive Computing Jay Boisseau, Director April 17, 2012 TACC Vision & Strategy Provide the most powerful, capable computing technologies and
More informationComputing over the Internet: Beyond Embarrassingly Parallel Applications. BOINC Workshop 09. Fernando Costa
Computing over the Internet: Beyond Embarrassingly Parallel Applications BOINC Workshop 09 Barcelona Fernando Costa University of Coimbra Overview Motivation Computing over Large Datasets Supporting new
More informationNGS Data and Sequence Alignment
Applications and Servers SERVER/REMOTE Compute DB WEB Data files NGS Data and Sequence Alignment SSH WEB SCP Manpreet S. Katari App Aug 11, 2016 Service Terminal IGV Data files Window Personal Computer/Local
More informationPOWER BI BOOTCAMP. COURSE INCLUDES: 4-days of instructor led discussion, Hands-on Office labs and ebook.
Course Code : AUDIENCE : FORMAT: LENGTH: POWER BI BOOTCAMP O365-412-PBID (CP PBD365) Professional Developers Instructor-led training with hands-on labs 4 Days COURSE INCLUDES: 4-days of instructor led
More informationAUTOMATIC QUALITY ASSESSMENT OF DIGITAL VIDEO COLLECTIONS
MARCH 17, 2016 ECSS SYMPOSIUM AUTOMATIC QUALITY ASSESSMENT OF DIGITAL VIDEO COLLECTIONS Texas Advanced Computing Center: ECSS: Anne Bowen, adb@tacc.utexas.edu ECSS: John Lockman john@vizias.com PI Maria
More informationData Analytics with HPC. Data Streaming
Data Analytics with HPC Data Streaming Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationGenome-wide analysis of degradome data using PAREsnip2
Genome-wide analysis of degradome data using PAREsnip2 24/01/2018 User Guide A tool for high-throughput prediction of small RNA targets from degradome sequencing data using configurable targeting rules
More informationFor Attribution: Developing Data Attribution and Citation Practices and Standards
For Attribution: Developing Data Attribution and Citation Practices and Standards Board on Research Data and Information Policy and Global Affairs Division National Research Council in collaboration with
More informationData Curation Profile Human Genomics
Data Curation Profile Human Genomics Profile Author Profile Author Institution Name Contact J. Carlson N. Brown Purdue University J. Carlson, jrcarlso@purdue.edu Date of Creation October 27, 2009 Date
More informationGenomics on Cisco Metacloud + SwiftStack
Genomics on Cisco Metacloud + SwiftStack Technology is a large component of driving discovery in both research and providing timely answers for clinical treatments. Advances in genomic sequencing have
More informationGenome-wide analysis of degradome data using PAREsnip2
Genome-wide analysis of degradome data using PAREsnip2 07/06/2018 User Guide A tool for high-throughput prediction of small RNA targets from degradome sequencing data using configurable targeting rules
More informationCyberinfrastructure!
Cyberinfrastructure! David Minor! UC San Diego Libraries! San Diego Supercomputer Center! January 4, 2012! Cyberinfrastructure:! History! Definitions! Examples! History! mid-1990s:! High performance computing
More informationData Movement & Storage Using the Data Capacitor Filesystem
Data Movement & Storage Using the Data Capacitor Filesystem Justin Miller jupmille@indiana.edu http://pti.iu.edu/dc Big Data for Science Workshop July 2010 Challenges for DISC Keynote by Alex Szalay identified
More informationHPC Capabilities at Research Intensive Universities
HPC Capabilities at Research Intensive Universities Purushotham (Puri) V. Bangalore Department of Computer and Information Sciences and UAB IT Research Computing UAB HPC Resources 24 nodes (192 cores)
More informationTidyFS: A Simple and Small Distributed Filesystem
TidyFS: A Simple and Small Distributed Filesystem Dennis Fe6erly 1, Maya Haridasan 1, Michael Isard 1, and Swaminathan Sundararaman 2 1 MicrosoA Research, Silicon Valley 2 University of Wisconsin, Madison
More informationICME: Status & Perspectives
ICME: Status & Perspectives from Materials Science and Engineering Surya R. Kalidindi Georgia Institute of Technology New Strategic Initiatives: ICME, MGI Reduce expensive late stage iterations Materials
More informationAvailability of Datasets for Digital Forensics & what is missing
Availability of Datasets for Digital Forensics & what is missing Cinthya Grajeda, Dr. Frank Breitinger, & Dr. Ibrahim Baggili Undergraduate researcher, UNHcFREG member DFRWS, Austin, Texas, 2017 Cyber
More informationBuilding the Digital Media Workstream: From project pitch to program purchase
MEDIA & ENTERTAINMENT Building the Digital Media Workstream: From project pitch to program purchase Charles Matheson, Industry Strategist, M&E, OpenText The old ways are fading away Big footprint Capital
More informationFast Forward I/O & Storage
Fast Forward I/O & Storage Eric Barton Lead Architect 1 Department of Energy - Fast Forward Challenge FastForward RFP provided US Government funding for exascale research and development Sponsored by 7
More informationObject Storage Level 100
Object Storage Level 100 Rohit Rahi November 2018 1 Safe Harbor Statement The following is intended to outline our general product direction. It is intended for information purposes only, and may not be
More informationCloudCenter for Developers
DEVNET-1198 CloudCenter for Developers Conor Murphy, Systems Engineer Data Centre Cisco Spark How Questions? Use Cisco Spark to communicate with the speaker after the session 1. Find this session in the
More informationAnalysis of Extended Performance for clustering of Satellite Images Using Bigdata Platform Spark
Analysis of Extended Performance for clustering of Satellite Images Using Bigdata Platform Spark PL.Marichamy 1, M.Phil Research Scholar, Department of Computer Application, Alagappa University, Karaikudi,
More informationBig Data infrastructure and tools in libraries
Line Pouchard, PhD Purdue University Libraries Research Data Group Big Data infrastructure and tools in libraries 08/10/2016 DATA IN LIBRARIES: THE BIG PICTURE IFLA/ UNIVERSITY OF CHICAGO BIG DATA: A VERY
More informationAppropriate Item Partition for Improving the Mining Performance
Appropriate Item Partition for Improving the Mining Performance Tzung-Pei Hong 1,2, Jheng-Nan Huang 1, Kawuu W. Lin 3 and Wen-Yang Lin 1 1 Department of Computer Science and Information Engineering National
More informationFuncX: A Function Serving Platform for HPC. Ryan Chard 28 Jan 2019
FuncX: A Function Serving Platform for HPC Ryan Chard 28 Jan 2019 Outline - Motivation FuncX: FaaS for HPC Implementation status Preliminary applications - Machine learning inference Automating analysis
More informationScaling Without Sharding. Baron Schwartz Percona Inc Surge 2010
Scaling Without Sharding Baron Schwartz Percona Inc Surge 2010 Web Scale!!!! http://www.xtranormal.com/watch/6995033/ A Sharding Thought Experiment 64 shards per proxy [1] 1 TB of data storage per node
More informationA Web Service for Scholarly Big Data Information Extraction
A Web Service for Scholarly Big Data Information Extraction Kyle Williams, Lichi Li, Madian Khabsa, Jian Wu, Patrick C. Shih and C. Lee Giles Information Sciences and Technology Computer Science and Engineering
More informationAgenda. Clarification of issues Quarter definition Steering and Executive Committee composition Dissemination and community outreach activities
Agenda Clarification of issues Quarter definition Steering and Executive Committee composition Dissemination and community outreach activities Progress and updates Y1Q3 and plans for Y1Q4 Plan for the
More informationApplication of machine learning and big data technologies in OpenAIRE system
Application of machine learning and big data technologies in OpenAIRE system Warsztaty Orange z cyklu Centrum Badawczo Rozwojowe zaprasza Mateusz Kobos, ICM, Univeristy of Warsaw Warszawa, 2017-05-10 OpenAIRE
More informationre3data.org - Making research data repositories visible and discoverable
re3data.org - Making research data repositories visible and discoverable Robert Ulrich, Karlsruhe Institute of Technology Hans-Jürgen Goebelbecker, Karlsruhe Institute of Technology Frank Scholze, Karlsruhe
More informationBreaking Down the Invisible Wall
Breaking Down the Invisible Wall to Enrich Archival Science and Practice Kenneth Thibodeau December 8, 2016 1 Record Integrity A document has integrity if continue to be capable of delivering the message
More informationdan.fay@microsoft.com Scientific Data Intensive Computing Workshop 2004 Visualizing and Experiencing E 3 Data + Information: Provide a unique experience to reduce time to insight and knowledge through
More informationCloud Facility for Advancing Scientific Communities
Cloud Facility for Advancing Scientific Communities George Turner, Chief Systems Architect Pervasive Technologies Institute, UITS/RT, Indiana University Modeling Research in the Cloud Workshop UCAR, Boulder,
More informationCyberinfrastructure Framework for 21st Century Science & Engineering (CIF21)
Cyberinfrastructure Framework for 21st Century Science & Engineering (CIF21) NSF-wide Cyberinfrastructure Vision People, Sustainability, Innovation, Integration Alan Blatecky Director OCI 1 1 Framing the
More informationPOWER BI DEVELOPER BOOTCAMP
POWER BI DEVELOPER BOOTCAMP Course Duration: 4 Days Overview The Power BI Developer Bootcamp is an intensive 4-day training course with hands-on labs designed to get professional software developers up
More informationDigital Library Interoperability. Europeana
Digital Library Interoperability technical and object modelling aspects Dr. Stefan Gradmann / EDLnet WP 2 stefan.gradmann@rrz.uni-hamburg.de www.rrz.uni-hamburg.de/rrz/s.gradmann of Europeana Interoperability,
More informationCloud Essentials for Architects using OpenStack
Cloud Essentials for Architects using OpenStack Course Overview Start Date 5th March 2015 Duration 2 Days Location Dublin Course Code SS15-13 Programme Overview Cloud Computing is gaining increasing attention
More informationLoad Balancing for Entity Matching over Big Data using Sorted Neighborhood
San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Fall 2015 Load Balancing for Entity Matching over Big Data using Sorted Neighborhood Yogesh Wattamwar
More informationSimon Mercer Director, Health & Wellbeing Microsoft Corporation
Simon Mercer Director, Health & Wellbeing Microsoft Corporation An open-source library of reusable bioinformatics algorithms and functions built on the.net platform Proteomics Customer Challenges Dependency
More informationDRYAD / DRYADLINQ OVERVIEW. Xavier Pillons, Principal Program Manager, Technical Computing Customer Advocate Team
DRYAD / DRYADLINQ OVERVIEW Xavier Pillons, Principal Program Manager, Technical Computing Customer Advocate Team Data Intensive Scalable Computing (DISC) Market Customer needs for DISC lie on a spectrum
More informationAutomated Debugging In Data Intensive Scalable Computing Systems
Automated Debugging In Data Intensive Scalable Computing Systems Muhammad Ali Gulzar 1, Matteo Interlandi 3, Xueyuan Han 2, Mingda Li 1, Tyson Condie 1, and Miryung Kim 1 1 University of California, Los
More informationAnnotation & Publishing Standards Work at the W3C
CNI Fall 2017 Membership Meeting 12 December 2017 - Washington, D.C. Annotation & Publishing Standards Work at the W3C Timothy W. Cole (t-cole3@illinois.edu) University of Illinois at Urbana-Champaign
More informationIBM Storage Solutions & Software Defined Infrastructure
IBM Storage Solutions & Software Defined Infrastructure Strategy, Trends, Directions Calline Sanchez, Vice President, IBM Enterprise Systems Storage Twitter: @cksanche LinkedIn: www.linkedin.com/pub/calline-sanchez/9/599/b09/
More informationThe RMap Project: Linking the Products of Research and Scholarly Communication Tim DiLauro
The RMap Project: Linking the Products of Research and Scholarly Communication 2015 04 22 Tim DiLauro Motivation Compound objects fast becoming the norm for outputs of scholarly communication.
More informationDELL EMC DATA DOMAIN SISL SCALING ARCHITECTURE
WHITEPAPER DELL EMC DATA DOMAIN SISL SCALING ARCHITECTURE A Detailed Review ABSTRACT While tape has been the dominant storage medium for data protection for decades because of its low cost, it is steadily
More informationOpenINTEL an infrastructure for long-term, large-scale and high-performance active DNS measurements. Design and Analysis of Communication Systems
OpenINTEL an infrastructure for long-term, large-scale and high-performance active DNS measurements DACS Design and Analysis of Communication Systems Why measure DNS? (Almost) every networked service relies
More informationPre-processing and quality control of sequence data. Barbera van Schaik KEBB - Bioinformatics Laboratory
Pre-processing and quality control of sequence data Barbera van Schaik KEBB - Bioinformatics Laboratory b.d.vanschaik@amc.uva.nl Topic: quality control and prepare data for the interesting stuf Keep Throw
More informationNUIT Tech Talk Topics in Research Computing: XSEDE and Northwestern University Campus Champions
NUIT Tech Talk Topics in Research Computing: XSEDE and Northwestern University Campus Champions Pradeep Sivakumar pradeep-sivakumar@northwestern.edu Contents What is XSEDE? Introduction Who uses XSEDE?
More informationOPENSTACK PRIVATE CLOUD WITH GITHUB
OPENSTACK PRIVATE CLOUD WITH GITHUB Kiran Gurbani 1 Abstract Today, with rapid growth of the cloud computing technology, enterprises and organizations need to build their private cloud for their own specific
More informationHPC: N x Contingency Analysis
Panel: Advanced Grid Modeling, Simulation, and Computing HPC: N x Contingency Analysis Workshop on Building Research Collaborations: Electricity Systems Purdue University, West Lafayette, IN August 28-29,
More information1. mirmod (Version: 0.3)
1. mirmod (Version: 0.3) mirmod is a mirna modification prediction tool. It identifies modified mirnas (5' and 3' non-templated nucleotide addition as well as trimming) using small RNA (srna) sequencing
More informationAnalyzing massive genomics datasets using Databricks Frank Austin Nothaft,
Analyzing massive genomics datasets using Databricks Frank Austin Nothaft, PhD frank.nothaft@databricks.com @fnothaft VISION Accelerate innovation by unifying data science, engineering and business PRODUCT
More informationTara McPherson School of Cinematic Arts USC Los Angeles, CA, USA
Tara McPherson School of Cinematic Arts USC Los Angeles, CA, USA Both scholarship + popular culture have gone online There were about 25,400 active scholarly peer-reviewed journals in early 2009, collectively
More informationXLDB 11 Cloud Computing at Scale. Roger Barga Microsoft Research
XLDB 11 Cloud Computing at Scale Roger Barga Microsoft Research Framing Questions for Presentation(s) Does it make sense for large-scale (many terabytes, petabytes), data-intensive projects to consider
More information