Spotfire Data Science with Hadoop Using Spotfire Data Science to Operationalize Data Science in the Age of Big Data

Size: px

Start display at page:

Download "Spotfire Data Science with Hadoop Using Spotfire Data Science to Operationalize Data Science in the Age of Big Data"

Virgil Daniels
5 years ago
Views:

1 Spotfire Data Science with Hadoop Using Spotfire Data Science to Operationalize Data Science in the Age of Big Data THE RISE OF BIG DATA BIG DATA: A REVOLUTION IN ACCESS Large-scale data sets are nothing new. After all, before the term big data, airline reservation systems tracked millions of flight segments and bookings, and phone companies kept billions of call detail records. But now it is possible for small companies and individuals to access the same massive computational and storage resources using inexpensive commodity hardware and the cloud. Central to this data-ubiquity story is the open-source distributed computational framework called Apache Hadoop. Created at Yahoo, based on Google s MapReduce and Google File System publications, Hadoop allows large datasets to be stored and parallel-processed by spreading files across a large number of small commodity servers. Hadoop has a large following in both the commercial and open-source software communities. Reduction in the cost of hardware and linear scalability of Hadoop has resulted in an unprecedented amount of data being stored and analyzed to increase our understanding of the physical world, predict human behavior, and improve performance and security.

2 WHITEPAPER 2 DATA SCIENCE WITH HADOOP Hadoop is ideally suited for data science due to a number of important capabilities: Storing and processing extremely large datasets on inexpensive hardware (that can be scaled up as data volume increases and return on investment is proven) Storing data without having to conform it, a priori, to a particular data model Handling diverse and rapidly changing data streams Job tracking and management tools that break down complex analytic routines into simple map and reduce steps Hadoop presents a compelling opportunity for any organization that wants to base decisions on insights gained from mining detailed data. It makes petabytes of data available for in-depth analysis across hundreds if not thousands of CPUs while keeping costs under control either through scale-as-you-go commodity hardware or by leveraging the elasticity of the cloud. Furthermore, the MapReduce paradigm has become prevalent in research areas of machine learning. Increasingly researchers are attempting to adapt the sequential nature of learning and convex optimization theories to the parallelization paradigm of MapReduce. ADVANCED ANALYTICS AT SCALE FOR THE FEW The caveat, however, is that the user has to possess the expertise to program in one or more of Hadoop s highly technical languages like MapReduce, Apache Hive, Apache Pig, etc. Translating the tools and techniques of analytics into these frameworks represents a significant challenge. The result is that only a small number of Internet properties, social media, and ecommerce sites have forayed into using Hadoop for data science, while most organizations are still using it mainly for data transformations and the most basic analytics. To reap the benefits of Hadoop, these early adopters make substantial investments in teams of Java engineers and statisticians, and often contribute heavily to Hadoop-related open-source projects. While promising, the results often suffer from limitations in performance, ease of use, agility, or flexibility. SPOTFIRE MAKES DATA SCIENCE ON HADOOP TURN-KEY TIBCO Spotfire Data Science is the first native-hadoop data science application. It allows experienced and aspiring data scientists to leverage the parallel-processing capabilities of Hadoop using an intuitive web-based, drag-and-drop user interface. Spotfire Data Science eliminates the need to program complex statistical functions such as linear and logistic regressions, k-means clustering, decision trees, scatter plots, and so on. Instead, it allows them to concentrate on data analysis and model development. Spotfire Data Science handles the entire analytics lifecycle: data exploration, transformations, model building, model validation, and model deployment. Highly accurate predictive and descriptive models can be built with the Spotfire Data Science Workflow Editor in a matter of minutes, since the need to program is eliminated and data is processed where it resides. The Spotfire Data Science web-based application is designed for rapid, iterative, and collaborative model development. Users can start either with a blank canvas

3 WHITEPAPER 3 and then rapidly assemble an analytic workflow by dragging and dropping Hadoop files and various operators, or they can extend an existing analytic workflow created by one of their colleagues. Workflows are version controlled, and there are detailed logs available about each run, including the visual results of each operator as well as performance statistics. Spotfire Data Science has undergone an extensive amount of testing and validation to meet enterprise-level standards of performance and security in the context of the rapidly evolving Hadoop ecosystem of tools and technologies. A COMPLETE DATA SCIENCE ENVIRONMENT TRADITIONAL APPROACH TO MODELING The traditional approach to building and deploying models starts with a sample data extract from one or more databases or Hadoop clusters into a flat-file format. This limited dataset is then used for analysis and training of the model in a scripting-based tool such as R or SAS. The model parameters are then communicated via a specification document to the data engineer, who uses it to create scoring code (in Java, SQL, etc). Finally, the data is either scored directly in the data warehouse or, if it doesn t all reside in one place, it is scored using flat files. Final results are imported back into the database to drive the behavior of operational applications (for example, to determine the specific offer that should be discussed with a customer the next time she phones into a call center). AGILE APPROACH TIBCO s approach is radically different. We have done all the difficult programming so that the user does not have to. The user experience is as easy as drawing a process diagram. The algorithms that provide this powerful capability are uniformly designed with regard to data inputs, outputs, and exception reporting. In addition all operators are clearly documented so that there is no need to read code to understand how a particular algorithm is going to behave. We also ensure all programming logic is brought to where the data resides, and no data or model information is ever moved between environments. For practitioners who prefer more code-intensive and notebook style interfaces, Spotfire Data Science integrates directly with Jupyter Notebooks. Data scientists can create data pipelines in Python and store these as managed analytic assets within the platform so their work is never lost and is always associated with a dedicated project. In addition to providing a highly scalable and parallel analytics environment, Spotfire Data Science allows users to collaborate more effectively with their business counterparts, from defining the goals of a data science project, to operationalizing their results. STAGES OF DATA SCIENCE Intuitive and highly visual, Spotfire supports all the major phases of data science. The Spotfire Data Science Workflow Editor provides a rich pallet of operators allowing users to quickly create complete workflows that cover the typical progression of a data science project.

4 WHITEPAPER 4 CREATING ANALYTIC WORKFLOWS The process typically starts with the user browsing the files available on the Hadoop File System (HDFS). The user is then able to drag and drop an icon representing the HDFS file onto an analytic workflow. Spotfire will assist the user in applying structure to the file whether it is a delimited flat file, JSON, XML, Apache Log, etc. Once structure has been applied, a right-click menu exposes the various exploration operations available such as summary statistics, frequency analysis, box plots, etc. To gain a more in-depth understanding of complex datasets, and to identify patterns hidden in the data, the user can run an unsupervised algorithm like k-means clustering. The variable selection operator can help the user find the fields that have most influence on the quantity being analyzed. Spotfire provides common transformation operators like row/column filter, aggregations, pivots, etc. However the user can also directly inject Pig scripts for more complex transformations. The data can then be randomly sampled for model training and validation. Spotfire supports a comprehensive set of classic model types, including regressions, decision trees, time series, and clustering. With these the user can mine data for new insights: predicting events, segmenting customers, and optimizing campaigns. Once a model has been trained, Spotfire provides a number of tools for evaluating the accuracy of the model and comparing it with others. DEPLOYING MODELS Spotfire can export models in industry standard formats such as PMML and PFA, allowing users to operationalize their results on third-party platforms. Users can import PFA models to score against new data and utilize them in their Spotfire Data Science Workflows. Spotfire also provides a variety of standalone RESTful scoring engines that support PFA, a powerful option for those seeking to operationalize models in an efficient way. Spotfire Data Science also manages and version controls models so work is never lost between teams, and previous versions can be found easily.

5 WHITEPAPER 5 CONCLUSION Within all but a few organizations, the promise of data science on big data has yet to be realized. While platforms such as Hadoop have already demonstrated the power of parallel numerical processing applied to real-world problems, the techniques of data science are largely confined to separate silos of processing, accessible to a few highly-trained individuals, and rarely applied to anything but small samples of highly structured data. Nevertheless, early research indicates that most machine learning algorithms can be fully implemented within the Hadoop framework. Spotifre has gone one step further: making those cutting-edge implementations available to non-programmers and aspiring data scientists in a web-based, collaborative application that supports the analytics process from end to end. SYSTEM REQUIREMENTS & SELECTED PLATFORMS WEB REQUIREMENTS: Chrome Firefox SERVER REQUIREMENTS: Dedicated Server Quad Core CPU (Multiple recommended) 48GB of RAM or higher recommended 500GB Storage (RAID 1 mirroring) OPERATING SYSTEM: RHEL/CENTOS INTEGRATIONS: MADlib PMML Python (Jupyter Notebooks) R Tableau SUPPORTED HADOOP DISTRIBUTIONS: Cloudera CDH Hortonworks SUPPORTED DATA PLATFORMS AS DATA SOURCES: Greenplum Database Oracle Database (11g, Exadata) PostgreSQL SQL Server Teradata SUPPORTED DATA PLATFORMS AS ANALYTICAL SOURCES: Cloudera CDH Greenplum Database Hive Hortonworks MapR Oracle Database (11g, Exadata and SQL are stored for future use. The platform also offers an API extension for embedding Spotfire Data Science logic into different applications and processes. Pivotal HD Pivotal HAWQ PostgreSQl IBM Big Insights MapR Pivotal HAWQ Global Headquarters 3307 Hillview Avenue Palo Alto, CA TEL FAX TIBCO fuels digital business by enabling better decisions and faster, smarter actions through the TIBCO Connected Intelligence Cloud. From APIs and systems to devices and people, we interconnect everything, capture data in real time wherever it is, and augment the intelligence of your business through analytical insights. Thousands of customers around the globe rely on us to build compelling experiences, energize operations, and propel innovation. Learn how TIBCO makes digital smarter at , TIBCO Software Inc. All rights reserved. TIBCO, the TIBCO logo, and Spotfire are trademarks or registered trademarks of TIBCO Software Inc. or its subsidiaries in the United States and/or other countries. Apache, Hadoop, Hive, and Pig are trademarks of The Apache Software Foundation in the United States and/or other countries. All other product and company names and marks in this document are the property of their respective owners and mentioned for identification purposes only. 02/13/18

Ten Innovative Financial Services Applications Powered by Data Virtualization

Ten Innovative Financial Services Applications Powered by Data Virtualization DATA IS THE NEW ALPHA In an industry driven to deliver alpha, where might financial services firms find opportunities when