KNIME for the life sciences Cambridge Meetup

Similar documents
Installation KNIME AG. All rights reserved. 1

Oracle Big Data. A NA LYT ICS A ND MA NAG E MENT.

ANALYSIS OF LARGE GRAPH DATA WITH GRADOOP AND KNIME

Oracle Big Data Connectors

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

KNIME Big Data Training

Going Big Data on Apache Spark. KNIME Italy Meetup

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

KNIME What s new?! Bernd Wiswedel KNIME.com AG, Zurich, Switzerland

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Specialist ICT Learning

Oracle Big Data Fundamentals Ed 2

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Big Data with Hadoop Ecosystem

7 Techniques for Data Dimensionality Reduction

Tackling Big Data Using MATLAB

Distributed Computing with Spark and MapReduce

Innovatus Technologies

Oracle Big Data Science

How to choose the right approach to analytics and reporting

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

An Introduction to Big Data Formats

The Evolution of Big Data Platforms and Data Science

Data Science Bootcamp Curriculum. NYC Data Science Academy

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

End-to-End data mining feature integration, transformation and selection with Datameer Datameer, Inc. All rights reserved.

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Creating a Recommender System. An Elasticsearch & Apache Spark approach

Big Data and FrameWorks; Perspectives to Applied Machine Learning

Data Science with PostgreSQL

DATA SCIENCE USING SPARK: AN INTRODUCTION

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

BIG DATA COURSE CONTENT

Oracle Big Data Science IOUG Collaborate 16

Deploying, Managing and Reusing R Models in an Enterprise Environment

Big Data Analytics using Apache Hadoop and Spark with Scala

An Introduction to Apache Spark

Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk

Python With Data Science

Sparkling Water. August 2015: First Edition

Data in the Cloud and Analytics in the Lake

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Hadoop. Introduction / Overview

Using Existing Numerical Libraries on Spark

Scalable Machine Learning in R. with H2O

R Language for the SQL Server DBA

Certified Big Data Hadoop and Spark Scala Course Curriculum

The Top 10 New Features in KNIME 2.8. Rosaria Silipo KNIME.com AG, San Francisco

Chapter 1 - The Spark Machine Learning Library

RethinkDB. Niharika Vithala, Deepan Sekar, Aidan Pace, and Chang Xu

Unifying Big Data Workloads in Apache Spark

Big Data Infrastructures & Technologies

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

ACHIEVEMENTS FROM TRAINING

Higher level data processing in Apache Spark

KNIME User Training KNIME AG. Copyright 2017 KNIME AG

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa

An InterSystems Guide to the Data Galaxy. Benjamin De Boe Product Manager

Introducing Oracle R Enterprise 1.4 -

Oracle GoldenGate for Big Data

Data Analytics and Machine Learning: From Node to Cluster

Oracle Big Data Fundamentals Ed 1

Data Lake Based Systems that Work

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Pre-Requisites: CS2510. NU Core Designations: AD

Data Science and Open Source Software. Iraklis Varlamis Assistant Professor Harokopio University of Athens

Big Data Architect.

Cloud Computing & Visualization

Modernization and how to implement Digital Transformation. Jarmo Nieminen Sales Engineer, Principal

KNIME Analytics Platform Course for Beginners

Integrating Advanced Analytics with Big Data

Accelerating Spark Workloads using GPUs

CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM

1Z Oracle Big Data 2017 Implementation Essentials Exam Summary Syllabus Questions

Dealing with Data Especially Big Data

Outrun Your Competition With SAS In-Memory Analytics Sascha Schubert Global Technology Practice, SAS

MapR Enterprise Hadoop

Oracle Big Data Cloud Service, Oracle Storage Cloud Service, Oracle Database Cloud Service

What is KNIME? workflows nodes standard data mining, data analysis data manipulation

Oracle Machine Learning Notebook

Chapter 6. Foundations of Business Intelligence: Databases and Information Management VIDEO CASES

NoSQL systems: introduction and data models. Riccardo Torlone Università Roma Tre

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

From Insight to Action: Analytics from Both Sides of the Brain. Vaz Balasingham Director of Solutions Consulting

Oracle Big Data SQL High Performance Data Virtualization Explained

TPCX-BB (BigBench) Big Data Analytics Benchmark

Hadoop & Big Data Analytics Complete Practical & Real-time Training

Distributed Machine Learning" on Spark

Informatica Enterprise Information Catalog

Oracle Big Data SQL. Release 3.2. Rich SQL Processing on All Data

Backtesting with Spark

CIS 612 Advanced Topics in Database Big Data Project Lawrence Ni, Priya Patil, James Tench

Windows Azure Overview

Apache Spark and Scala Certification Training

Certified Big Data and Hadoop Course Curriculum

Databricks, an Introduction

Transcription:

KNIME for the life sciences Cambridge Meetup Greg Landrum, Ph.D. KNIME.com AG 12 July 2016

What is KNIME? A bit of motivation: tool blending, data blending, documentation, automation, reproducibility More about the company and the community Some highlights from recent releases Copyright 2014 KNIME.com AG 2

The KNIME Analytics Platform Copyright 2014 KNIME.com AG 3

Visual KNIME Workflows NODES perform tasks on data Inputs Status Nodes are combined to create WORKFLOWS Outputs Not Configured Idle Executed Error Copyright 2014 KNIME.com AG 4

Over 1000 native and embedded nodes included: Data Access MySQL, Oracle,... SAS, SPSS,... Excel, Flat,... Hive, Impala,... XML, JSON, PMML Text, Doc, Image,... Web Crawlers Industry Specific Community / 3rd Transformation Row, Column Matrix Text, Image Time Series Java Python Community / 3rd Analysis & Mining Statistics Data Mining Machine Learning Web Analytics Text Mining Network Analysis Social Media Analysis R, Weka, Python Community / 3rd Visualization R JFreeChart JavaScript Community / 3rd Deployment via BIRT PMML XML, JSON Databases Excel, Flat, etc. Text, Doc, Image Industry Specific Community / 3rd Copyright 2014 KNIME.com AG 5

Why is this important? Real world data analysis: Lots of heterogeneous data from multiple sources (data blending) Complex questions to ask of the data Need to apply multiple tools (tool blending) Copyright 2014 KNIME.com AG 6

The problem is that we don t have one of these for working with our data Copyright 2014 KNIME.com AG 7

The problem is that we don t have one of these for working with our data If all of your problems look like this: then this is the perfect tool: It s amazing how simple you can make complex things if you control the entire process Copyright 2014 KNIME.com AG 8

We tend to need a broader assortment of tools for our data https://www.flickr.com/ photos/mtnee r_ man/5247813293 If we re lucky they are this well organized Copyright 2014 KNIME.com AG Copyright 2016 KNIME.com AG 9

but this is a lot more common https://www.flickr.com/ photos/tilde -lifestyle -photog raphy/6906117081/ Copyright 2014 KNIME.com AG Copyright 2016 KNIME.com AG 10 10

Why is this important? Real world data analysis: Lots of heterogeneous data from multiple sources (data blending) Complex questions to ask of the data Need to apply multiple tools (tool blending) Things that would be great: We didn t have to spend half our time converting file formats We could figure out later what we did, repeat it, and share that with others This is where KNIME comes in Copyright 2014 KNIME.com AG 11

KNIME: the company 12

KNIME KNIME.com AG founded in 2008 Offices in Zurich (HQ), Konstanz, Berlin, and San Francisco 20+ employees Maintainer of the Open Source KNIME Analytics Platform comprehensive data loading, processing, analysis, modeling platform visual frontend open: to all sorts of data, other tools (R and Python, a.o.), various user personas 20 open source releases since 2006 open source. KNIME Commercial Extensions for Collaboration, Productivity, Performance 14 commercial product releases since 2008 Copyright 2014 KNIME.com AG 13

Broad Range of KNIME Application Areas & Customers Pharma Manufacturing Health Care Advanced Analytics Customer Intelligence Finance Retail Copyright 2014 KNIME.com AG 14

Happy users! Source: http://r4stats.com/2016/06/06/rexer-data-science-survey-satisfaction-results/ Copyright 2014 KNIME.com AG 15

KNIME Analytics Platform: Try it Now! 1. Download from www.knime.com 2. Browse the KNIME Learning Hub at www.knime.com/learning-hub 3. Download your free copy of the KNIME Beginner s Guide from: www.knime.com/knimepress (use code: KNIME_Boston2016) 4. Visit us here or at our Forum: www.knime.com/forum Copyright 2014 KNIME.com AG 16

The KNIME Ecosystem 17 17

KNIME Software KNIME commercial extensions to the platform for collaboration, productivity, performance Copyright 2014 KNIME.com AG 18

KNIME Server Copyright 2014 KNIME.com AG 19

KNIME Big Data Extensions (commercial license required!) KNIME Big Data Connectors Package required drivers/libraries for specific HDFS, Hive, Impala access Hive (Big Data Extension) Cloudera Impala (Big Data Extension) Extends the open source database integration KNIME Spark Executor Package required drivers to submit Spark jobs Wraps Spark DB manipulations and MLlib modules Copyright 2014 KNIME.com AG 20

Big Data Connectors Same mode of operation as the standard KNIME database connectors Operations are performed within the database Copyright 2014 KNIME.com AG 21

KNIME Spark Executor Based on Spark MLlib Scalable machine learning library Runs on Hadoop Algorithms for Classification (decision tree, naïve bayes, ) Regression (logistic regression, linear regression, ) Clustering (k-means) Collaborative filtering (ALS) Dimensionality reduction (SVD, PCA) Copyright 2014 KNIME.com AG 22

Familiar Usage Model Usage model and dialogs similar to existing nodes No coding required Copyright 2014 KNIME.com AG 23

The KNIME community 24

Openness and the community Very active user community (check the forums) >250 people at the 2016 KNIME Summit in Berlin The KNIME Analytics Platform is both open source and an open platform. Technology partners: provide and support nodes for their (usually commercial) softare. Some examples: Schrodinger, ChemAxon/InfoCom, CCG, Cresset We encourage the community to produce nodes (or sets of nodes) and share them with each other. Trusted Community Extensions for community contributions that meet a certain quality level. Copyright 2014 KNIME.com AG 25

Some of the community contributions: This is the subset more relevant to drug discovery Copyright 2014 KNIME.com AG 26

Highlights of recent additions in KNIME 3.1 and 3.2 Complete lists: https://tech.knime.org/whats-new-in-knime-31 https://tech.knime.org/whats-new-in-knime-32 27

Streaming Default Execution Streaming Execution Row-wise Process, pass & forget Faster with less I/O overhead Concurrent execution Copyright 2014 KNIME.com AG 28

Streaming Pros and Cons Advantages Less I/O overhead (process, pass & forget) Parallelization Disadvantages No intermediate results, no interactive execution Not all nodes can be streamed Copyright 2014 KNIME.com AG 29

Trees / Forest / Ensembles Random forest node (simplification of the treeensemble node) Support of binary splits for nominal attributes Missing value handling Support of byte vector data (high-dimension count fingerprints) Code optimization Runtime Memory Copyright 2014 KNIME.com AG 30

Trees and Tree Ensembles: New nodes Gradient Boosting Also based on tree ensembles Boosting: Improving an existing model by adding a new model Shallow trees Random Forest Distance Distance measure induced by a random forest Based on proximity Copyright 2014 KNIME.com AG 31

Feature Selection Automated help for narrowing down the best set of features for a model Supports forward and backward selection Copyright 2014 KNIME.com AG 32

Deeplearning4j KNIME Integration Easy network architecture design Modular Layerwisedesign of networks Model Import/Export Caffe Import Beginner friendly Import pretrained networks Highly configurable Supports word2vec and doc2vec Copyright 2014 KNIME.com AG 33

Deeplearning4j KNIME Integration Copyright 2014 KNIME.com AG 34

Active Learning Labs Extension Involve user to construct training data set Workflow loop to query and label interesting data points Used user-labeled data set on remaining data Copyright 2014 KNIME.com AG 35

R Integration Rewrite of infrastructure Significantly faster Concurrent execution No change of usage model Copyright 2014 KNIME.com AG 36

MongoDB and JSON (I) MongoDB is a NoSQL database based on JSON Special set of nodes due to lack of a standard SQL interface Copyright 2014 KNIME.com AG 37

MongoDB and JSON (II) JSON nodes for working with JSON data Similar to the XML nodes Use combination of MongoDB and JSON nodes Copyright 2014 KNIME.com AG 38

Semantic Web/Linked Data Integration Access and manipulate semantic web resources e.g. DBpedia Execute semantic queries via SPARQL Usage model similar to database integration Copyright 2014 KNIME.com AG 39

Other cool stuff Workflow coach: suggests next nodes to use Copyright 2014 KNIME.com AG 40

Take homes Open platform based on open-source software backed by a commercial entity providing enterprise extensions and support Strong focus on data blending and tool blending Active and engaged community Great support for life sciences/chemistry from the community Copyright 2014 KNIME.com AG 41

Thanks! Enjoy the other talks. 42

14-16 September, 2016 San Francisco https://www.knime.org/fall-summit2016 43