IBM Data Science Experience (DSX) Partner Application Validation Quick Guide VERSION: 2.0 DATE: Feb 15, 2018 EDITOR: D. Rangarao
Table of Contents 1 Overview of the Application Validation Process... 3 2 Platform Specific Considerations IBM DSX Family... 3 2.1 Introduction... 3 2.2 Architecture... 4 2.3 Scope... 5 3 Resources for IBM DSX Validation... 5 3.1 Assistance... 5 3.2 Platform Access... 6 3.3 Recommended tests... 6 1. Submit your solution to achieve the Ready for IBM Analytics Badge... 8 5/31/17 2
1 Overview of the Application Validation Process Ready for IBM Analytics is the IBM technical validation process to help partners enhance their technology value proposition with IBM products; achieve optimal interoperability; and help ensure client satisfaction. This Application Validation Guide contains essential information about accessing our technology. Once approved, you will be able to download and use the Ready for IBM Analytics mark on collateral and be included in the Business Partner Application Showcase. We have some basic tests that we prescribe in section 3.3 below, this validates basic functionality. Application providers should execute additional tests that they currently use for unit/regression testing. We believe that you, the application providers, are the best judge of whether your application is functioning correctly. However, should you need technical assistance from our subject matter experts, we are here to work with you. 2 Platform Specific Considerations IBM DSX Family 2.1 Introduction IBM Data Science Experience (DSX) is an interactive, collaborative environment where Data Scientists can use multiple tools to activate their insights. Data Scientists can choose to use the Jupyter-based notebook interface with a choice of runtimes including R, Python and Scala or use the embedded RStudio. IBM DSX is available in different configurations including:- - DSX Desktop (installed on laptop/desktop) - DSX Local (installed in private cloud) - DSX on the public cloud The focus of this document is IBM DSX Local. 5/31/17 3
2.2 Architecture IBM DSX Local runs on a Kubernetes cluster of servers, comprised of the following components:- Control Plane (Master) - Requires three master nodes to manage the entire cluster - Uses etcd as a key value store that persists the cluster state and stores metadata about cluster service deployment and heatlh. - Uses Prometheus for monitoring and Elk for logging. Storage For data stores and storage management: - Uses GlusterFS for storage management. - Uses IBM Cloudant DB as service meta database. - Uses Redis as the in-memory database. - Uses Swift Oject Store for user artifacts. - Uses Elasticsearch DB for logs. Deciding DSX Local configuration For the best resiliency, you can set up on a minimum of nine nodes: three for Control plane (one for master and two for high availability), three for Storage (one primary and two for high availability), and three for Compute (one primary and two for high availability). Alternatively, you can set up three nodes: one node with Control plane, Storage, and Compute on it, and two extra nodes for high availability. Figure: Architecture for minimum of nine nodes 5/31/17 4
Figure: Architecture for a minimum of three nodes 2.3 Scope This guide is provided for partners who want to validate their application for the IBM DSX Local product, which is deployed as both an on premises server and in private cloud configurations. 3 Resources for IBM DSX Validation 3.1 Assistance If you need assistance with the process, please submit an inquiry to the Analytics Ecosystem team and we will promptly contact you to answer any questions you may have. IBM ROLE NAME EMAIL PHONE Analytics Ecosystem -team email address- IBM Analytics Ecosystem Team DSX specialist Deepak Rangarao drangar@us.ibm.com 973-216-8283 5/31/17 5
3.2 Platform Access IBM DSX Desktop Free Beta Version IBM DSX Desktop is a free client for data scientists and data engineers. It includes Jupyter-based notedbooks to create, import or run code using one of the three runtimes (R, Python, Scala) and RStudio, a tool for statistical analysis and machine learning with R. Link: IBM DSX Desktop Note: In some circumstances partners might be able to leverage the DSX Desktop version to test their integration touch points while they wait to get access to a DSX Local installation. IBM DSX Local Community Edition Access to IBM labs managed instance of IBM DSX IBM DSX Local is an on-premise enterprise solution for data scientists and data engineers. It includes Jupyter-based nodebooks to create, import or run code using one of the three runtimes (R, Python, Scala and RStudio a tool for statistical analysis and machine learning with R. Functional richer than IBM DSX Desktop, this includes additional functionality around collaboration. It has the notion of community for sharing of analytic and data assets and Projects and ACL s for access and governance of analytic and data assets. Link: IBM DSX Local In some circumstances, it may be possible to make use of a managed instance of IBM DSX running at IBM. Access to this is arranged via Analytics Lab Services. Please contact the Technical Contact listed in section 3 above. 3.3 Recommended tests Test Steps Expected Outcome Install your Execute the following in a Notebook application cell to check for existing libraries 5/31/17 6
Libraries/Dependenc ies in DSX Notebooks (Python) Install your application JAR files in DSX Notebooks (Scala)!pip list isolated Execute the following command to install a package/dependency!pip install --user <package_name> Execute the following in a Notebook cell to install custom JAR file %AddJar URL_to_jar_file You should see a table view of the packages installed and the versions. You should see a message indicating successful install of package. The JAR file should now be available for use in the Notebook. JAR files can also be loaded from a public Maven repository using the following command (In this example we are including the dependency for org.apache.spark.spark-streamingkafka_2.10) %AddDeps org.apache.spark sparkstreaming-kafka_2.10 1.1.0 -- transitive Install R packages in RStudio or Notebooks Execute the following command in a Notebook cell or in RStudio install.packages( URL_TO_PACKA GE ) Or install.packages( PACKAGE_NAM E ) The R package should now be ready to use by using the following command library( PACKAGE_NAM E ) 5/31/17 7
PixieDust Visualization PixieDust is an open source Python helper library that works as an add-on to Jupyter notebooks to improve the user experience of working with data. This includes a package manager to install Spark packages inside a Python notebook and visualization capabilities to visualize Spark objects in different ways including table, charts, maps etc. Note: Pixiedust currently works with Spark 1.6 or 2.0 and Python 2.7 or 3.5. Note: Pixiedust currently supports Spark DataFrames, Spark GraphFrames and Pandas DataFrames, with more to come. If you can't wait, write your own today and contribute it back. Execute the following in a Notebook cell to visualize the data using PixieDust #import pixiedust display module from pixiedust.display import * Display(<SPARK_OBJECT_NAME >) You should see the PixieDust visualization in the Notebook output cell and be able to change the type of visualization depending on the type of data. 4. Submit your solution to achieve the Ready for IBM Analytics Badge In order to approve your solution validation with IBM's products, we need some information from you. Please go to http://ibm.biz/r4validation, fill in the information, and submit. 5/31/17 8