Graph-Based Genomic Integration Using Spark
|
|
- Coleen Allen
- 6 years ago
- Views:
Transcription
1 Graph-Based Genomic Integration Using Spark David Tester Novartis Institutes for Biomedical Research 1!
2 Reference Genome Database External Transcript Database Internal Expression Measurements Public Variation Database Biological Sample Database Variant Effect Prediction Algorithms Ontology of Sequence Alterations Ontology of Health Conditions Ontology of Anatomical Structures Scope Transcript Cancerous 34 copies Hippocampus Tissue Sample ACTAGATAGCAGCTAGCTAGCGCGCGATATCGCGATAGCGCGATAGCGCTAGAGCTCGTCGCAAAGCTGAGCTCG Insertion: TCAAT Produces Frameshift Event
3 Scope (more) Much, much more over time New types of information (compound interactions, protein interactions,...) Confidence and Error Propagation Need to be extremely flexible Model flexibility (our scientists disagree with each other!) Analytic flexibility (technologies change!)
4 Data Characteristics Moderately sized As graph: currently low trillions of edges As tables: currently 100s of billions of rows But Growing Extremely Fast Driven by logarithmic decrease in sequencing costs Also Extremely Messy 10s-100s of internal and external sources Incentive structure at odds with standardization?
5 More a Framework than a Database Overall Architecture We Input: Input API Integration Layer Output API Raw data and metadata We Add: Markup (Scala DSL) CSVs Data and Metadata with User Markup CSVs Ontologies Graph Representation Table Markup Production Databases Ad-hoc Datamarts Ontologies We Archive Graph representations of all datasets + markup VCFs Etc VCFs Etc Analysis Scripts (Spark) Non-Database Analytic Tools Various We Output: Vertica, SciDB, R API, Dataframes, Others Analytics API Local Validation of Markup, Datasets, and Ontologies Spark Global Validation Analysis- or Endpoint- Specific Validation
6 Integration Mechanics 1. Parse Input API Integration Layer Output API 2. Group into semantic units Data and Metadata with User Markup Ontologies Graph Representation Production Databases 3. Create integration instructions 4. Join to other datasets CSVs CSVs Table Markup Ad-hoc Datamarts 5. Convert to edges 6. Enrich edges (reasoning) VCFs VCFs Analysis Scripts (Spark) Non-Database Analytic Tools Etc Etc Various 7. Assign IDs 8. Publish Local Validation of Markup, Datasets, and Ontologies Global Validation Analysis- or Endpoint- Specific Validation Spark
7 Why Graphs? We really just want binary relationships Example: HasDiagnosis ( ThisTissue, Adenoma ) Flexible representation (more expressive than key/value, less rigid than tables) Well-understood reasoning mechanics (transitive closures, type inference, validation, ) Useful for integration Logically equivalent to a graph with labeled, directed edges For topological and/or network analyses (e.g., shortest paths) GraphX
8 Why not a Graph toolkit? Distributed graph TKs require an efficient partitioning strategy Typically a function from edge or vertex to its partition We partition over semantic units (encoded as subgraphs) Each subgraph contains the edges which describe a semantic unit Has natural mapping back/forth to source files Naturally expressed in Spark, awkward for GraphX But GraphX is just a flatmap away when we need it!
9 Future Heavy interest in not-just-genomic data More reliance on Spark-based analytics Parquet / Spark Dataframes / Spark SQL / Adam Trillion edge graph follows exponential trajectory in data size over next few years
10 Acknowledgements Rob Anderson Hans Bitter Mark Borowsky Sophie Brachat Victor Bucor Rose Brannon Jason Calvert Dennis Cunningham John Damask Timothy Danford Anthony Dibiase Sean Duane Chris Farnham Nick Flower Laurent Gauthier Ajay Gourneni Nabil Hachem Victor Hong Mike Jones Jason Kondracki Andrew Knueven Igor Mendelev Steve Marshall Gregg McAllister Brant Peterson Martin Petracchi Brian Repko Erik Sassaman Mark Schreiber Ruth Seltzer Dave Sexton Anita Stout David Treff Quan Yang Dongmei Zuo Many others
11 We re Hiring Spark Gurus! Novartis Institutes for Biomedical Research - Cambridge, Massachusetts View our open positions at:
Why do we need graph processing?
Why do we need graph processing? Community detection: suggest followers? Determine what products people will like Count how many people are in different communities (polling?) Graphs are Everywhere Group
More informationA Tutorial on Apache Spark
A Tutorial on Apache Spark A Practical Perspective By Harold Mitchell The Goal Learning Outcomes The Goal Learning Outcomes NOTE: The setup, installation, and examples assume Windows user Learn the following:
More informationDistributed Time Travel for Feature Generation. Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi
Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi Turn on Netflix, and the absolute best content for you would automatically start playing Everything is a
More informationFinding and Exporting Data. BioMart
September 2017 Finding and Exporting Data Not sure what tool to use to find and export data? BioMart is used to retrieve data for complex queries, involving a few or many genes or even complete genomes.
More informationLarge Scale Graph Solutions: Use-cases And Lessons Learnt
Large Scale Graph Solutions: Use-cases And Lessons Learnt Principal Engineer, AI/Cloud Platforms Abstraction Is The Art Euler s Bridges - Seven Bridges of Königsberg G = (V, E); V(id, attr1, attr2,..);
More informationSpark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM
Spark 2 Alexey Zinovyev, Java/BigData Trainer in EPAM With IT since 2007 With Java since 2009 With Hadoop since 2012 With EPAM since 2015 About Secret Word from EPAM itsubbotnik Big Data Training 3 Contacts
More informationWHAT S NEW IN SPARK 2.0: STRUCTURED STREAMING AND DATASETS
WHAT S NEW IN SPARK 2.0: STRUCTURED STREAMING AND DATASETS Andrew Ray StampedeCon 2016 Silicon Valley Data Science is a boutique consulting firm focused on transforming your business through data science
More informationApache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context
1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes
More informationOverview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::
Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized
More informationThe European Variation Archive
The European Variation Archive Webinar: A database of all types of genomic variation data from all species Hannah McLaren www.ebi.ac.uk/eva eva-helpdesk@ebi.ac.uk Learning objectives Establish the key
More informationImproved VariantSpark breaks the curse of dimensionality for machine learning on genomic data
Shiratani Unsui forest by Σ64 Improved VariantSpark breaks the curse of dimensionality for machine learning on genomic data Oscar J. Luo Health Data Analytics 12 th October 2016 HEALTH & BIOSECURITY Transformational
More informationData Transformation and Migration in Polystores
Data Transformation and Migration in Polystores Adam Dziedzic, Aaron Elmore & Michael Stonebraker September 15th, 2016 Agenda Data Migration for Polystores: What & Why? How? Acceleration of physical data
More informationAnalyzing massive genomics datasets using Databricks Frank Austin Nothaft,
Analyzing massive genomics datasets using Databricks Frank Austin Nothaft, PhD frank.nothaft@databricks.com @fnothaft VISION Accelerate innovation by unifying data science, engineering and business PRODUCT
More informationUnifying Big Data Workloads in Apache Spark
Unifying Big Data Workloads in Apache Spark Hossein Falaki @mhfalaki Outline What s Apache Spark Why Unification Evolution of Unification Apache Spark + Databricks Q & A What s Apache Spark What is Apache
More informationMothra: A Large-Scale Data Processing Platform for Network Security Analysis
Mothra: A Large-Scale Data Processing Platform for Network Security Analysis Tony Cebzanov Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 REV-03.18.2016.0 1 Agenda Introduction
More informationDelving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture
Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Hadoop 1.0 Architecture Introduction to Hadoop & Big Data Hadoop Evolution Hadoop Architecture Networking Concepts Use cases
More informationTechnical Sheet NITRODB Time-Series Database
Technical Sheet NITRODB Time-Series Database 10X Performance, 1/10th the Cost INTRODUCTION "#$#!%&''$!! NITRODB is an Apache Spark Based Time Series Database built to store and analyze 100s of terabytes
More informationAutomated Debugging In Data Intensive Scalable Computing Systems
Automated Debugging In Data Intensive Scalable Computing Systems Muhammad Ali Gulzar 1, Matteo Interlandi 3, Xueyuan Han 2, Mingda Li 1, Tyson Condie 1, and Miryung Kim 1 1 University of California, Los
More informationAnalyzing Flight Data
IBM Analytics Analyzing Flight Data Jeff Carlson Rich Tarro July 21, 2016 2016 IBM Corporation Agenda Spark Overview a quick review Introduction to Graph Processing and Spark GraphX GraphX Overview Demo
More informationData-Transformation on historical data using the RDF Data Cube Vocabulary
Data-Transformation on historical data using the RD Data Cube Vocabulary Sebastian Bayerl, Michael Granitzer Department of Media Computer Science University of Passau SWIB15 Semantic Web in Libraries 22.10.2015
More informationGraph Data Management Systems in New Applications Domains. Mikko Halin
Graph Data Management Systems in New Applications Domains Mikko Halin Introduction Presentation is based on two papers Graph Data Management Systems for New Application Domains - Philippe Cudré-Mauroux,
More informationCompile-Time Code Generation for Embedded Data-Intensive Query Languages
Compile-Time Code Generation for Embedded Data-Intensive Query Languages Leonidas Fegaras University of Texas at Arlington http://lambda.uta.edu/ Outline Emerging DISC (Data-Intensive Scalable Computing)
More informationBioJava. What is BioJava? An Open-Source Java Library for Bioinformatics
BioJava An Open-Source Java Library for Bioinformatics (taken from M. Pocock, BioJava Consulting LTD, presentation) What is BioJava? Java code (Java2 required 1.2 and higher) Open-Source Bioinformatics
More informationContext-Aware Systems. Michael Maynord Feb. 24, 2014
Context-Aware Systems Michael Maynord Feb. 24, 2014 The precise definition of 'context' is contentious. Here we will be using 'context' as any information that can be used to characterize the situation
More informationAutomating the Collection and Processing of Cancer Mutation Data
Automating the Collection and Processing of Cancer Mutation Data Master s Project Report Jeremy Watson Advisor: Ben Raphael 1. Introduction 1 The Raphael lab has developed multiple software packages, such
More informationHigher level data processing in Apache Spark
Higher level data processing in Apache Spark Pelle Jakovits 12 October, 2016, Tartu Outline Recall Apache Spark Spark DataFrames Introduction Creating and storing DataFrames DataFrame API functions SQL
More informationA Mathematical Model For Treatment Selection Literature
A Mathematical Model For Treatment Selection Literature G. Duncan and W. W. Koczkodaj Abstract Business Intelligence tools and techniques, when applied to a data store of bibliographical references, can
More informationPowering Knowledge Discovery. Insights from big data with Linguamatics I2E
Powering Knowledge Discovery Insights from big data with Linguamatics I2E Gain actionable insights from unstructured data The world now generates an overwhelming amount of data, most of it written in natural
More informationOpen Data Standards for Administrative Data Processing
University of Pennsylvania ScholarlyCommons 2018 ADRF Network Research Conference Presentations ADRF Network Research Conference Presentations 11-2018 Open Data Standards for Administrative Data Processing
More informationSpecialist ICT Learning
Specialist ICT Learning APPLIED DATA SCIENCE AND BIG DATA ANALYTICS GTBD7 Course Description This intensive training course provides theoretical and technical aspects of Data Science and Business Analytics.
More informationAsanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks
Asanka Padmakumara ETL 2.0: Data Engineering with Azure Databricks Who am I? Asanka Padmakumara Business Intelligence Consultant, More than 8 years in BI and Data Warehousing A regular speaker in data
More informationFrom Think Like a Vertex to Think Like a Graph. Yuanyuan Tian, Andrey Balmin, Severin Andreas Corsten, Shirish Tatikonda, John McPherson
From Think Like a Vertex to Think Like a Graph Yuanyuan Tian, Andrey Balmin, Severin Andreas Corsten, Shirish Tatikonda, John McPherson Large Scale Graph Processing Graph data is everywhere and growing
More informationExtreme-scale Graph Analysis on Blue Waters
Extreme-scale Graph Analysis on Blue Waters 2016 Blue Waters Symposium George M. Slota 1,2, Siva Rajamanickam 1, Kamesh Madduri 2, Karen Devine 1 1 Sandia National Laboratories a 2 The Pennsylvania State
More informationOn enhancing variation detection through pan-genome indexing
Standard approach...t......t......t......acgatgctagtgcatgt......t......t......t... reference genome Variation graph reference SNP: A->T...ACGATGCTTGTGCATGT donor genome Can we boost variation detection
More informationGraph Analytics in the Big Data Era
Graph Analytics in the Big Data Era Yongming Luo, dr. George H.L. Fletcher Web Engineering Group What is really hot? 19-11-2013 PAGE 1 An old/new data model graph data Model entities and relations between
More informationmodern database systems lecture 10 : large-scale graph processing
modern database systems lecture 1 : large-scale graph processing Aristides Gionis spring 18 timeline today : homework is due march 6 : homework out april 5, 9-1 : final exam april : homework due graphs
More informationJure Leskovec Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah
Jure Leskovec (@jure) Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah 2 My research group at Stanford: Mining and modeling large social and information networks
More informationAbout Codefrux While the current trends around the world are based on the internet, mobile and its applications, we try to make the most out of it. As for us, we are a well established IT professionals
More informationSpatial Analytics Built for Big Data Platforms
Spatial Analytics Built for Big Platforms Roberto Infante Software Development Manager, Spatial and Graph 1 Copyright 2011, Oracle and/or its affiliates. All rights Global Digital Growth The Internet of
More information2017 GridGain Systems, Inc. In-Memory Performance Durability of Disk
In-Memory Performance Durability of Disk Meeting the Challenges of Fast Data in Healthcare with In-Memory Technologies Akmal Chaudhri Technology Evangelist GridGain Agenda Introduction Fast Data in Healthcare
More informationResearch challenges in data-intensive computing The Stratosphere Project Apache Flink
Research challenges in data-intensive computing The Stratosphere Project Apache Flink Seif Haridi KTH/SICS haridi@kth.se e2e-clouds.org Presented by: Seif Haridi May 2014 Research Areas Data-intensive
More informationApache Spark 2.0. Matei
Apache Spark 2.0 Matei Zaharia @matei_zaharia What is Apache Spark? Open source data processing engine for clusters Generalizes MapReduce model Rich set of APIs and libraries In Scala, Java, Python and
More informationmicrosoft
70-775.microsoft Number: 70-775 Passing Score: 800 Time Limit: 120 min Exam A QUESTION 1 Note: This question is part of a series of questions that present the same scenario. Each question in the series
More informationIBM Db2 Event Store Simplifying and Accelerating Storage and Analysis of Fast Data. IBM Db2 Event Store
IBM Db2 Event Store Simplifying and Accelerating Storage and Analysis of Fast Data IBM Db2 Event Store Disclaimer The information contained in this presentation is provided for informational purposes only.
More informationData Curation Profile Human Genomics
Data Curation Profile Human Genomics Profile Author Profile Author Institution Name Contact J. Carlson N. Brown Purdue University J. Carlson, jrcarlso@purdue.edu Date of Creation October 27, 2009 Date
More informationApache Spark Graph Performance with Memory1. February Page 1 of 13
Apache Spark Graph Performance with Memory1 February 2017 Page 1 of 13 Abstract Apache Spark is a powerful open source distributed computing platform focused on high speed, large scale data processing
More informationPROJECT PROPOSALS: COMMUNITY DETECTION AND ENTITY RESOLUTION. Donatella Firmani
PROJECT PROPOSALS: COMMUNITY DETECTION AND ENTITY RESOLUTION Donatella Firmani donatella.firmani@uniroma3.it PROJECT 1: COMMUNITY DETECTION What is Community Detection? What Social Network Analysis is?
More informationOracle Big Data Connectors
Oracle Big Data Connectors Oracle Big Data Connectors is a software suite that integrates processing in Apache Hadoop distributions with operations in Oracle Database. It enables the use of Hadoop to process
More informationIntroduction to Data Science
UNIT I INTRODUCTION TO DATA SCIENCE Syllabus Introduction of Data Science Basic Data Analytics using R R Graphical User Interfaces Data Import and Export Attribute and Data Types Descriptive Statistics
More information2/26/2017. Originally developed at the University of California - Berkeley's AMPLab
Apache is a fast and general engine for large-scale data processing aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes Low latency: sub-second
More informationLocality-sensitive hashing and biological network alignment
Locality-sensitive hashing and biological network alignment Laura LeGault - University of Wisconsin, Madison 12 May 2008 Abstract Large biological networks contain much information about the functionality
More informationAnalytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation
Analytic Cloud with Shelly Garion IBM Research -- Haifa 2014 IBM Corporation Why Spark? Apache Spark is a fast and general open-source cluster computing engine for big data processing Speed: Spark is capable
More informationApache Spark Tutorial
Apache Spark Tutorial Reynold Xin @rxin BOSS workshop at VLDB 2017 Apache Spark The most popular and de-facto framework for big data (science) APIs in SQL, R, Python, Scala, Java Support for SQL, ETL,
More informationIn-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet
In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet Ema Iancuta iorhian@gmail.com Radu Chilom radu.chilom@gmail.com Big data analytics / machine learning 6+ years
More informationRStream:Marrying Relational Algebra with Streaming for Efficient Graph Mining on A Single Machine
RStream:Marrying Relational Algebra with Streaming for Efficient Graph Mining on A Single Machine Guoqing Harry Xu Kai Wang, Zhiqiang Zuo, John Thorpe, Tien Quang Nguyen, UCLA Nanjing University Facebook
More informationDRYAD: DISTRIBUTED DATA- PARALLEL PROGRAMS FROM SEQUENTIAL BUILDING BLOCKS
DRYAD: DISTRIBUTED DATA- PARALLEL PROGRAMS FROM SEQUENTIAL BUILDING BLOCKS Authors: Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, Dennis Fetterly Presenter: Zelin Dai WHAT IS DRYAD Combines computational
More informationGreenplum-Spark Connector Examples Documentation. kong-yew,chan
Greenplum-Spark Connector Examples Documentation kong-yew,chan Dec 10, 2018 Contents 1 Overview 1 1.1 Pivotal Greenplum............................................ 1 1.2 Pivotal Greenplum-Spark Connector...................................
More informationLecture 20: The Spatial Semantic Hierarchy. What is a Map?
Lecture 20: The Spatial Semantic Hierarchy CS 344R/393R: Robotics Benjamin Kuipers What is a Map? A map is a model of an environment that helps an agent plan and take action. A topological map is useful
More informationApproaching the Petabyte Analytic Database: What I learned
Disclaimer This document is for informational purposes only and is subject to change at any time without notice. The information in this document is proprietary to Actian and no part of this document may
More informationThe Seven Steps to Implement DataOps
The Seven Steps to Implement Ops ABSTRACT analytics teams challenged by inflexibility and poor quality have found that Ops can address these and many other obstacles. Ops includes tools and process improvements
More informationBig Linked Data ETL Benchmark on Cloud Commodity Hardware
Big Linked Data ETL Benchmark on Cloud Commodity Hardware iminds Ghent University Dieter De Witte, Laurens De Vocht, Ruben Verborgh, Erik Mannens, Rik Van de Walle Ontoforce Kenny Knecht, Filip Pattyn,
More informationGeospatial Analytics at Scale
Geospatial Analytics at Scale Milos Milovanovic CoFounder & Data Engineer milos@thingsolver.com Things Solver ENLIGHTEN YOUR DATA What we do? Advanced Analytics company based in Belgrade, Serbia. We are
More informationPaperclip graphs. Steve Butler Erik D. Demaine Martin L. Demaine Ron Graham Adam Hesterberg Jason Ku Jayson Lynch Tadashi Tokieda
Paperclip graphs Steve Butler Erik D. Demaine Martin L. Demaine Ron Graham Adam Hesterberg Jason Ku Jayson Lynch Tadashi Tokieda Abstract By taking a strip of paper and forming an S curve with paperclips
More informationHadoop Development Introduction
Hadoop Development Introduction What is Bigdata? Evolution of Bigdata Types of Data and their Significance Need for Bigdata Analytics Why Bigdata with Hadoop? History of Hadoop Why Hadoop is in demand
More informationBasics of Data Management
Basics of Data Management Chaitan Baru 2 2 Objectives of this Module Introduce concepts and technologies for managing structured, semistructured, unstructured data Obtain a grounding in traditional data
More informationBig Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara
Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Agenda End-to-End Data Delivery Platform Ecosystem of Data Technologies Mapping an End-to-End Solution Case
More informationData-Intensive Distributed Computing
Data-Intensive Distributed Computing CS 451/651 431/631 (Winter 2018) Part 5: Analyzing Relational Data (1/3) February 8, 2018 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo
More informationCONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM
CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED PLATFORM Executive Summary Financial institutions have implemented and continue to implement many disparate applications
More informationIBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics
IBM Data Science Experience White paper R Transforming R into a tool for big data analytics 2 R Executive summary This white paper introduces R, a package for the R statistical programming language that
More informationSix Core Data Wrangling Activities. An introductory guide to data wrangling with Trifacta
Six Core Data Wrangling Activities An introductory guide to data wrangling with Trifacta Today s Data Driven Culture Are you inundated with data? Today, most organizations are collecting as much data in
More informationDATA SCIENCE USING SPARK: AN INTRODUCTION
DATA SCIENCE USING SPARK: AN INTRODUCTION TOPICS COVERED Introduction to Spark Getting Started with Spark Programming in Spark Data Science with Spark What next? 2 DATA SCIENCE PROCESS Exploratory Data
More informationThe Evolution of a Data Project
The Evolution of a Data Project The Evolution of a Data Project Python script The Evolution of a Data Project Python script SQL on live DB The Evolution of a Data Project Python script SQL on live DB SQL
More informationGraph Data Management
Graph Data Management Analysis and Optimization of Graph Data Frameworks presented by Fynn Leitow Overview 1) Introduction a) Motivation b) Application for big data 2) Choice of algorithms 3) Choice of
More informationExtreme-scale Graph Analysis on Blue Waters
Extreme-scale Graph Analysis on Blue Waters 2016 Blue Waters Symposium George M. Slota 1,2, Siva Rajamanickam 1, Kamesh Madduri 2, Karen Devine 1 1 Sandia National Laboratories a 2 The Pennsylvania State
More informationGalaxy. Daniel Blankenberg The Galaxy Team
Galaxy Daniel Blankenberg The Galaxy Team http://galaxyproject.org Overview What is Galaxy? What you can do in Galaxy analysis interface, tools and datasources data libraries workflows visualization sharing
More informationThe Human Variant Database
The Human Variant Database Mya Warren Michael Smith Genome Sciences Centre Vancouver BC Bioinforma=cs is Big Data Human genome has 3 billion nucleo=de bases 60 thousand genes 10-20 thousand proteins Bioinforma=cs
More informationSparkling Water. August 2015: First Edition
Sparkling Water Michal Malohlava Alex Tellez Jessica Lanford http://h2o.gitbooks.io/sparkling-water-and-h2o/ August 2015: First Edition Sparkling Water by Michal Malohlava, Alex Tellez & Jessica Lanford
More informationCSE 444: Database Internals. Lecture 23 Spark
CSE 444: Database Internals Lecture 23 Spark References Spark is an open source system from Berkeley Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei
More informationStream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...
Data Ingestion ETL, Distcp, Kafka, OpenRefine, Query & Exploration SQL, Search, Cypher, Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...
More informationNow That You Have Your Data in Hadoop, How Are You Staging Your Analytical Base Tables?
Paper SAS 1866-2015 Now That You Have Your Data in Hadoop, How Are You Staging Your Analytical Base Tables? Steven Sober, SAS Institute Inc. ABSTRACT Well, Hadoop community, now that you have your data
More informationStream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...
Data Ingestion ETL, Distcp, Kafka, OpenRefine, Query & Exploration SQL, Search, Cypher, Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...
More informationIntegration with popular Big Data Frameworks in Statistica and Statistica Enterprise Server Solutions Statistica White Paper
and Statistica Enterprise Server Solutions Statistica White Paper Siva Ramalingam Thomas Hill TIBCO Statistica Table of Contents Introduction...2 Spark Support in Statistica...3 Requirements...3 Statistica
More informationHymenopteraMine Documentation
HymenopteraMine Documentation Release 1.0 Aditi Tayal, Deepak Unni, Colin Diesh, Chris Elsik, Darren Hagen Apr 06, 2017 Contents 1 Welcome to HymenopteraMine 3 1.1 Overview of HymenopteraMine.....................................
More informationThe Future of Real-Time in Spark
The Future of Real-Time in Spark Reynold Xin @rxin Spark Summit, New York, Feb 18, 2016 Why Real-Time? Making decisions faster is valuable. Preventing credit card fraud Monitoring industrial machinery
More informationExam Questions
Exam Questions 70-775 Perform Data Engineering on Microsoft Azure HDInsight (beta) https://www.2passeasy.com/dumps/70-775/ NEW QUESTION 1 You are implementing a batch processing solution by using Azure
More informationValue of Data Transformation. Sean Kandel, Co-Founder and CTO of Trifacta
Value of Data Transformation Sean Kandel, Co-Founder and CTO of Trifacta Organizations today generate and collect an unprecedented volume and variety of data. At the same time, the adoption of data-driven
More informationMicrosoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo
Microsoft Exam Questions 70-775 Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo NEW QUESTION 1 You have an Azure HDInsight cluster. You need to store data in a file format that
More informationApache Spark and Scala Certification Training
About Intellipaat Intellipaat is a fast-growing professional training provider that is offering training in over 150 most sought-after tools and technologies. We have a learner base of 600,000 in over
More informationMachine Learning & Google Big Query. Data collection and exploration notes from the field
Machine Learning & Google Big Query Data collection and exploration notes from the field Limited to support of Machine Learning (ML) tasks Review tasks common to ML use cases Data Exploration Text Classification
More informationAnalyzing Big Data with Microsoft R
Analyzing Big Data with Microsoft R 20773; 3 days, Instructor-led Course Description The main purpose of the course is to give students the ability to use Microsoft R Server to create and run an analysis
More informationNOMAD Metadata for all
EMMC Workshop on Interoperability NOMAD Metadata for all Cambridge, 8 Nov 2017 Fawzi Mohamed FHI Berlin NOMAD Center of excellence goals 200,000 materials known to exist basic properties for very few highly
More informationEncoding and Designing for the Swift Poems Project
Encoding and Designing for the Swift Poems Project Jonathan Swift and the Text Encoding Initiative James R. Griffin III Digital Library Developer Lafayette College Libraries Introductions James Woolley
More informationTH IRD EDITION. Python Cookbook. David Beazley and Brian K. Jones. O'REILLY. Beijing Cambridge Farnham Köln Sebastopol Tokyo
TH IRD EDITION Python Cookbook David Beazley and Brian K. Jones O'REILLY. Beijing Cambridge Farnham Köln Sebastopol Tokyo Table of Contents Preface xi 1. Data Structures and Algorithms 1 1.1. Unpacking
More informationRECORD LINKAGE, A REAL USE CASE WITH SPARK ML. Alexis Seigneurin
RECORD LINKAGE, A REAL USE CASE WITH SPARK ML Alexis Seigneurin Who I am Software engineer for 15 years Consultant at Ippon USA, previously at Ippon France Favorite subjects: Spark, Machine Learning, Cassandra
More informationNCI Thesaurus, managing towards an ontology
NCI Thesaurus, managing towards an ontology CENDI/NKOS Workshop October 22, 2009 Gilberto Fragoso Outline Background on EVS The NCI Thesaurus BiomedGT Editing Plug-in for Protege Semantic Media Wiki supports
More informationUsing ESML in a Semantic Web Approach for Improved Earth Science Data Usability
Using in a Semantic Web Approach for Improved Earth Science Data Usability Rahul Ramachandran, Helen Conover, Sunil Movva and Sara Graves Information Technology and Systems Center University of Alabama
More informationAcyclic Network. Tree Based Clustering. Tree Decomposition Methods
Summary s Join Tree Importance of s Solving Topological structure defines key features for a wide class of problems CSP: Inference in acyclic network is extremely efficient (polynomial) Idea: remove cycles
More informationBig Data. Big Data Analyst. Big Data Engineer. Big Data Architect
Big Data Big Data Analyst INTRODUCTION TO BIG DATA ANALYTICS ANALYTICS PROCESSING TECHNIQUES DATA TRANSFORMATION & BATCH PROCESSING REAL TIME (STREAM) DATA PROCESSING Big Data Engineer BIG DATA FOUNDATION
More informationSocial Network Analytics on Cray Urika-XA
Social Network Analytics on Cray Urika-XA Mike Hinchey, mhinchey@cray.com Technical Solutions Architect Cray Inc, Analytics Products Group April, 2015 Agenda 1. Introduce platform Urika-XA 2. Technology
More informationOracle Big Data SQL. Release 3.2. Rich SQL Processing on All Data
Oracle Big Data SQL Release 3.2 The unprecedented explosion in data that can be made useful to enterprises from the Internet of Things, to the social streams of global customer bases has created a tremendous
More information