KNIME for the life sciences Cambridge Meetup Greg Landrum, Ph.D. KNIME.com AG 12 July 2016
What is KNIME? A bit of motivation: tool blending, data blending, documentation, automation, reproducibility More about the company and the community Some highlights from recent releases Copyright 2014 KNIME.com AG 2
The KNIME Analytics Platform Copyright 2014 KNIME.com AG 3
Visual KNIME Workflows NODES perform tasks on data Inputs Status Nodes are combined to create WORKFLOWS Outputs Not Configured Idle Executed Error Copyright 2014 KNIME.com AG 4
Over 1000 native and embedded nodes included: Data Access MySQL, Oracle,... SAS, SPSS,... Excel, Flat,... Hive, Impala,... XML, JSON, PMML Text, Doc, Image,... Web Crawlers Industry Specific Community / 3rd Transformation Row, Column Matrix Text, Image Time Series Java Python Community / 3rd Analysis & Mining Statistics Data Mining Machine Learning Web Analytics Text Mining Network Analysis Social Media Analysis R, Weka, Python Community / 3rd Visualization R JFreeChart JavaScript Community / 3rd Deployment via BIRT PMML XML, JSON Databases Excel, Flat, etc. Text, Doc, Image Industry Specific Community / 3rd Copyright 2014 KNIME.com AG 5
Why is this important? Real world data analysis: Lots of heterogeneous data from multiple sources (data blending) Complex questions to ask of the data Need to apply multiple tools (tool blending) Copyright 2014 KNIME.com AG 6
The problem is that we don t have one of these for working with our data Copyright 2014 KNIME.com AG 7
The problem is that we don t have one of these for working with our data If all of your problems look like this: then this is the perfect tool: It s amazing how simple you can make complex things if you control the entire process Copyright 2014 KNIME.com AG 8
We tend to need a broader assortment of tools for our data https://www.flickr.com/ photos/mtnee r_ man/5247813293 If we re lucky they are this well organized Copyright 2014 KNIME.com AG Copyright 2016 KNIME.com AG 9
but this is a lot more common https://www.flickr.com/ photos/tilde -lifestyle -photog raphy/6906117081/ Copyright 2014 KNIME.com AG Copyright 2016 KNIME.com AG 10 10
Why is this important? Real world data analysis: Lots of heterogeneous data from multiple sources (data blending) Complex questions to ask of the data Need to apply multiple tools (tool blending) Things that would be great: We didn t have to spend half our time converting file formats We could figure out later what we did, repeat it, and share that with others This is where KNIME comes in Copyright 2014 KNIME.com AG 11
KNIME: the company 12
KNIME KNIME.com AG founded in 2008 Offices in Zurich (HQ), Konstanz, Berlin, and San Francisco 20+ employees Maintainer of the Open Source KNIME Analytics Platform comprehensive data loading, processing, analysis, modeling platform visual frontend open: to all sorts of data, other tools (R and Python, a.o.), various user personas 20 open source releases since 2006 open source. KNIME Commercial Extensions for Collaboration, Productivity, Performance 14 commercial product releases since 2008 Copyright 2014 KNIME.com AG 13
Broad Range of KNIME Application Areas & Customers Pharma Manufacturing Health Care Advanced Analytics Customer Intelligence Finance Retail Copyright 2014 KNIME.com AG 14
Happy users! Source: http://r4stats.com/2016/06/06/rexer-data-science-survey-satisfaction-results/ Copyright 2014 KNIME.com AG 15
KNIME Analytics Platform: Try it Now! 1. Download from www.knime.com 2. Browse the KNIME Learning Hub at www.knime.com/learning-hub 3. Download your free copy of the KNIME Beginner s Guide from: www.knime.com/knimepress (use code: KNIME_Boston2016) 4. Visit us here or at our Forum: www.knime.com/forum Copyright 2014 KNIME.com AG 16
The KNIME Ecosystem 17 17
KNIME Software KNIME commercial extensions to the platform for collaboration, productivity, performance Copyright 2014 KNIME.com AG 18
KNIME Server Copyright 2014 KNIME.com AG 19
KNIME Big Data Extensions (commercial license required!) KNIME Big Data Connectors Package required drivers/libraries for specific HDFS, Hive, Impala access Hive (Big Data Extension) Cloudera Impala (Big Data Extension) Extends the open source database integration KNIME Spark Executor Package required drivers to submit Spark jobs Wraps Spark DB manipulations and MLlib modules Copyright 2014 KNIME.com AG 20
Big Data Connectors Same mode of operation as the standard KNIME database connectors Operations are performed within the database Copyright 2014 KNIME.com AG 21
KNIME Spark Executor Based on Spark MLlib Scalable machine learning library Runs on Hadoop Algorithms for Classification (decision tree, naïve bayes, ) Regression (logistic regression, linear regression, ) Clustering (k-means) Collaborative filtering (ALS) Dimensionality reduction (SVD, PCA) Copyright 2014 KNIME.com AG 22
Familiar Usage Model Usage model and dialogs similar to existing nodes No coding required Copyright 2014 KNIME.com AG 23
The KNIME community 24
Openness and the community Very active user community (check the forums) >250 people at the 2016 KNIME Summit in Berlin The KNIME Analytics Platform is both open source and an open platform. Technology partners: provide and support nodes for their (usually commercial) softare. Some examples: Schrodinger, ChemAxon/InfoCom, CCG, Cresset We encourage the community to produce nodes (or sets of nodes) and share them with each other. Trusted Community Extensions for community contributions that meet a certain quality level. Copyright 2014 KNIME.com AG 25
Some of the community contributions: This is the subset more relevant to drug discovery Copyright 2014 KNIME.com AG 26
Highlights of recent additions in KNIME 3.1 and 3.2 Complete lists: https://tech.knime.org/whats-new-in-knime-31 https://tech.knime.org/whats-new-in-knime-32 27
Streaming Default Execution Streaming Execution Row-wise Process, pass & forget Faster with less I/O overhead Concurrent execution Copyright 2014 KNIME.com AG 28
Streaming Pros and Cons Advantages Less I/O overhead (process, pass & forget) Parallelization Disadvantages No intermediate results, no interactive execution Not all nodes can be streamed Copyright 2014 KNIME.com AG 29
Trees / Forest / Ensembles Random forest node (simplification of the treeensemble node) Support of binary splits for nominal attributes Missing value handling Support of byte vector data (high-dimension count fingerprints) Code optimization Runtime Memory Copyright 2014 KNIME.com AG 30
Trees and Tree Ensembles: New nodes Gradient Boosting Also based on tree ensembles Boosting: Improving an existing model by adding a new model Shallow trees Random Forest Distance Distance measure induced by a random forest Based on proximity Copyright 2014 KNIME.com AG 31
Feature Selection Automated help for narrowing down the best set of features for a model Supports forward and backward selection Copyright 2014 KNIME.com AG 32
Deeplearning4j KNIME Integration Easy network architecture design Modular Layerwisedesign of networks Model Import/Export Caffe Import Beginner friendly Import pretrained networks Highly configurable Supports word2vec and doc2vec Copyright 2014 KNIME.com AG 33
Deeplearning4j KNIME Integration Copyright 2014 KNIME.com AG 34
Active Learning Labs Extension Involve user to construct training data set Workflow loop to query and label interesting data points Used user-labeled data set on remaining data Copyright 2014 KNIME.com AG 35
R Integration Rewrite of infrastructure Significantly faster Concurrent execution No change of usage model Copyright 2014 KNIME.com AG 36
MongoDB and JSON (I) MongoDB is a NoSQL database based on JSON Special set of nodes due to lack of a standard SQL interface Copyright 2014 KNIME.com AG 37
MongoDB and JSON (II) JSON nodes for working with JSON data Similar to the XML nodes Use combination of MongoDB and JSON nodes Copyright 2014 KNIME.com AG 38
Semantic Web/Linked Data Integration Access and manipulate semantic web resources e.g. DBpedia Execute semantic queries via SPARQL Usage model similar to database integration Copyright 2014 KNIME.com AG 39
Other cool stuff Workflow coach: suggests next nodes to use Copyright 2014 KNIME.com AG 40
Take homes Open platform based on open-source software backed by a commercial entity providing enterprise extensions and support Strong focus on data blending and tool blending Active and engaged community Great support for life sciences/chemistry from the community Copyright 2014 KNIME.com AG 41
Thanks! Enjoy the other talks. 42
14-16 September, 2016 San Francisco https://www.knime.org/fall-summit2016 43