SAS Visual Analytics Environment Stood Up? Check! Data Automatically Loaded and Refreshed? Not Quite

Similar documents
Using Metadata Queries To Build Row-Level Audit Reports in SAS Visual Analytics

SAS/Warehouse Administrator Usage and Enhancements Terry Lewis, SAS Institute Inc., Cary, NC

Extending the Scope of Custom Transformations

Visual Analytics 7.1

Your Data Visualization Game Is Strong Take It to Level 8.2

A Practical Introduction to SAS Data Integration Studio

Improving Your Relationship with SAS Enterprise Guide Jennifer Bjurstrom, SAS Institute Inc.

SAS Viya 3.2: Self-Service Import

SAS Data Explorer 2.1: User s Guide

Paper HOW-06. Tricia Aanderud, And Data Inc, Raleigh, NC

SAS Visual Analytics 7.3 for SAS Cloud: Onboarding Guide

Is Your Data Viable? Preparing Your Data for SAS Visual Analytics 8.2

SAS Data Integration Studio 3.3. User s Guide

SAS ENTERPRISE GUIDE USER INTERFACE

SAS Visual Analytics 7.4: Administration Guide

PREREQUISITES FOR EXAMPLES

Alan Davies and Sarah Perry

SAS Cloud Analytic Services 3.2: Accessing and Manipulating Data

SAS Studio: A New Way to Program in SAS

Guide Users along Information Pathways and Surf through the Data

SAS Business Rules Manager 2.1

BI-09 Using Enterprise Guide Effectively Tom Miron, Systems Seminar Consultants, Madison, WI

Grid Computing in SAS 9.4

What's New in SAS Data Management

SAS Model Manager 2.2. Tutorials

While You Were Sleeping - Scheduling SAS Jobs to Run Automatically Faron Kincheloe, Baylor University, Waco, TX

WHAT IS THE CONFIGURATION TROUBLESHOOTER?

Journey to the center of the earth Deep understanding of SAS language processing mechanism Di Chen, SAS Beijing R&D, Beijing, China

SAS Clinical Data Integration 2.4

Paper BI SAS Enterprise Guide System Design. Jennifer First-Kluge and Steven First, Systems Seminar Consultants, Inc.

SAS Model Manager 15.1: Quick Start Tutorial

Microsoft Access Illustrated. Unit B: Building and Using Queries

SAS. Information Map Studio 3.1: Creating Your First Information Map

Data Warehousing. New Features in SAS/Warehouse Administrator Ken Wright, SAS Institute Inc., Cary, NC. Paper

Approaches for Upgrading to SAS 9.2. CHAPTER 1 Overview of Migrating Content to SAS 9.2

Programming Beyond the Basics. Find() the power of Hash - How, Why and When to use the SAS Hash Object John Blackwell

Configuring SAS Web Report Studio Releases 4.2 and 4.3 and the SAS Scalable Performance Data Server

Submitting Code in the Background Using SAS Studio

Why SAS Programmers Should Learn Python Too

SAS Clinical Data Integration Server 2.1

Administering SAS Enterprise Guide 4.2

Paper ###-YYYY. SAS Enterprise Guide: A Revolutionary Tool! Jennifer First, Systems Seminar Consultants, Madison, WI

SAS IT Resource Management 3.8: Reporting Guide

Lasso Your Business Users by Designing Information Pathways to Optimize Standardized Reporting in SAS Visual Analytics

SAS Publishing SAS. Forecast Studio 1.4. User s Guide

Technical Paper. Using SAS Studio to Open SAS Enterprise Guide Project Files. (Experimental in SAS Studio 3.6)

Effective Usage of SAS Enterprise Guide in a SAS 9.4 Grid Manager Environment

WiredContact Enterprise Import Instructions

Developing SAS Studio Repositories

SAS Viya 3.3 Administration: Identity Management

ABSTRACT MORE THAN SYNTAX ORGANIZE YOUR WORK THE SAS ENTERPRISE GUIDE PROJECT. Paper 50-30

Beginning Tutorials. BT004 Enterprise Guide Version 2.0 NESUG 2003 James Blaha, Pace University, Briarcliff Manor, NY ABSTRACT: INTRODUCTION:

SAS System Powers Web Measurement Solution at U S WEST

Come On, Baby, Light my SAS Viya : Programming for CAS

ABSTRACT. Paper

Interface Reference topics

SAS Strategy Management 5.2 Batch Maintenance Facility

What s New in SAS Studio?

CollabNet Desktop - Microsoft Windows Edition

How Managers and Executives Can Leverage SAS Enterprise Guide

While You Were Sleeping, SAS Was Hard At Work Andrea Wainwright-Zimmerman, Capital One Financial, Inc., Richmond, VA

Big Data Executive Program

SAS. Studio 4.1: User s Guide. SAS Documentation

DBLOAD Procedure Reference

SAS Business Rules Manager 1.2

Seamless Dynamic Web (and Smart Device!) Reporting with SAS D.J. Penix, Pinnacle Solutions, Indianapolis, IN

Taming a Spreadsheet Importation Monster

SAS Model Manager 2.3

Give m Their Way They ll Love it! Sarita Prasad Bedge, Family Care Inc., Portland, Oregon

SAS Viya 3.2 Administration: External Credentials

Technical Paper. Defining an OLEDB Using Windows Authentication in SAS Management Console

Wrangling Your Data into Shape for In-Memory Analytics

It s not the Yellow Brick Road but the SAS PC FILES SERVER will take you Down the LIBNAME PATH= to Using the 64-Bit Excel Workbooks.

Center for Surveillance, Epidemiology, and Laboratory Services Division of Health Informatics and Surveillance

Hands-Off SAS Administration Using Batch Tools to Make Your Life Easier

Installation and Configuration Instructions. SAS Model Manager API. Overview

SAS Viya 3.2 Administration: Promotion (Import and Export)

SAS Enterprise Miner : Tutorials and Examples

SAS University Edition: Installation Guide for Windows

Best Practice for Creation and Maintenance of a SAS Infrastructure

SAS Enterprise Case Management 2.1. Administrator s Guide

Version 6 and Version 7: A Peaceful Co-Existence Steve Beatrous and James Holman, SAS Institute Inc., Cary, NC

Excel4apps Reports Distribution Manager User Guide (SAP) 2013 Excel4apps

SAS 9 Programming Enhancements Marje Fecht, Prowerk Consulting Ltd Mississauga, Ontario, Canada

SAS Data Integration Studio Take Control with Conditional & Looping Transformations

A Methodology for Truly Dynamic Prompting in SAS Stored Processes

SAS Viya 3.3 Administration: External Credentials

SAS Enterprise Case Management 2.2. Administrator s Guide

Report Commander 2 User Guide

Parallelizing Windows Operating System Services Job Flows

Using the SQL Editor. Overview CHAPTER 11

SAS Clinical Data Integration 2.6

Using SAS Enterprise Guide with the WIK

Certkiller.A QA

Business Insight Authoring

SAS Enterprise Guide 4.3

SAS Data Libraries. Definition CHAPTER 26

Transitioning from Batch and Interactive SAS to SAS Enterprise Guide Brian Varney, Experis Business Analytics, Portage, MI

Remodeling Your Office A New Look for the SAS Add-In for Microsoft Office

Running PeopleSoft Query Viewer and Running Query to Excel Basic Steps

Transcription:

Paper SAS1952-2015 SAS Visual Analytics Environment Stood Up? Check! Data Automatically Loaded and Refreshed? Not Quite Jason Shoffner, SAS Institute Inc., Cary, NC ABSTRACT Once you have a SAS Visual Analytics environment up and running, the next important piece to the puzzle is to keep your users happy by keeping their data loaded and refreshed on a consistent basis. Loading data from the SAS Visual Analytics user interface is a great first start and great for ad hoc data exploring. But automating this data load so that users can focus on exploring the data and creating reports is where the power of SAS Visual Analytics comes into play. By using tried-and-true SAS Data Integration Studio techniques (both out of the box and custom transforms), you can easily make this happen. Proven techniques such as sweeping from a source library and stacking similar HDFS tables into SAS LASR Analytic Server for consumption by SAS Visual Analytics are presented using SAS Visual Analytics and SAS Data Integration Studio. INTRODUCTION SAS Visual Analytics has powerful integrated applications and processes to load data into SAS LASR Analytic Server so that the data is consumable through reports and explorations. These include SAS Visual Analytics Data Builder, the import data wizard used when selecting a data source from SAS Visual Analytics Designer or SAS Visual Analytics Explorer, and the autoloader. All of these are available out of the box with SAS Visual Analytics. However, in many cases they do not meet the needs for an enterprise-level SAS Visual Analytics environment in terms of flexibility and insulation between different groups of users. This paper introduces methods used to implement additional data-loading concepts to create a more robust batch environment for keeping the data loaded and refreshed within SAS Visual Analytics. The sweeper concept enables you to load all (or a subset) of the data sets, tables or views from a source library to memory. After these objects have been loaded to memory, they are registered into metadata so that they can be consumed by reports and explorations in SAS Visual Analytics. The stacker, on the other hand, enables you to load a subset of similar source data sets, tables or views into a single table in memory. There are several transforms that make up these data-load techniques. These transforms can be used in tandem to gain all the functionality or independently for a more simplified approach. Figure 1 is a screenshot from the SAS Data Integration Studio Transformations tab. It shows the 4 resulting transformations used to implement the expanded data-loading techniques. Figure 1: Transformations Tab from SAS Data Integration Studio REVIEW OF THE AUTOLOADER AND ITS COMPARISON TO THE SWEEPER TRANSFORMS The autoloader introduced in SAS Visual Analytics 6.2 is powerful. Based on permission or role settings, it enables users to save files (CSV, XLS, XLSX, and SAS data sets) to a specific folder to load data into SAS LASR Analytic Server automatically in a background process. Depending on the location of these files, the autoloader will load, append, or remove the data to or from SAS LASR Analytic Server. A system administrator is required to configure this setup prior to use, and customization is limited. Using a SAS Data Integration Studio 1

transformation, the sweepers also allow automatic loading from source data locations, but they greatly extend the functionality to handle additional data types and independent use among client groups with no administrator intervention. The autoloader and the sweeper concept have two main differences: 1. The HDFS Sweeper enables you to store larger data sets in HDFS. This is beneficial because the refresh window for the LASR table can be much shorter when loading from HDFS versus loading from source on large tables. The LASR Sweeper will automatically check to see whether a version of a source table is preloaded into HDFS and load from there when that is the case. 2. The sweeper concept enables you to pull from any source that can be referenced as a SAS library by using a LIBNAME statement. Some examples include Oracle, Microsoft SQL Server, SAS/SHARE, and MySQL. Table 1 shows a full comparison between the two methods. Autoloader in SAS Visual Analytics 7.1 File types that can be loaded Delimited files (CSV) Spreadsheets (XLS, XLSX) SAS data sets Scheduler Cron Windows Scheduler Can append to existing LASR tables Can unload Must be placed in the Append folder Must be manually moved to the Unload folder Loads to and from HDFS No Uses smart loading (only loads when data is changed) Based on file attributes (modified date) Setup required Out of the box there is one autoload folder and one LASR server. To set up separate subfolders and LASR server, a system administrator will need to be heavily involved. Size limitations (~4Gb) Integrates with SAS Data Integration Studio Based on file attributes (file size) Table 1: Comparison between Autoloader and Sweeper No Sweeper Delimited files (CSV) PRE Sweeper is required Spreadsheets (XLS, XLSX No PRE Sweeper is required SAS data sets SAS data sets on SAS/SHARE Oracle tables or views SQL Server tables or views Any LIBNAME-compatible source Cron Windows Scheduler Process Manager Flow Manager (LSF) See HDFS Stacker later in this paper for an append method. Once a source is removed, it can automatically unload from LASR server (and HDFS) if that option is selected. HDFS Sweeper will load to HDFS based on a size threshold. LASR Sweeper will load from HDFS when a corresponding SASHDAT file exists. No Based on file attributes (modified date) When a table s modified data is not available (i.e. from RDBMS sources) it will load only when the record count changes (if that option is set) Metadata-driven based on Source, HDFS, and LASR libraries. After library metadata is created, the transform will work with any Source, LASR, and HDFS library. Custom transforms and SAS macros can be easily customized for your environment. 2

WHAT IS THIS SWEEPER PROCESS ANYWAY? Now that you have seen a comparison between the autoloader and the sweeper, you probably want to see the sweeper in more detail. This paper assumes that you have a co-located HDFS environment in conjunction with a SAS LASR Analytic Server environment that is used by SAS Visual Analytics. If that is not the case, then the LASR Sweeper will still work without the use of the HDFS Sweeper. The two main sweeper transforms, LASR and HDFS, are very similar. (These transforms are explained in the following sections.) This consistency in design is intentional to promote ease of use and understanding between the two transformations. Figure 2 represents how the sweepers work independently or in combination. The HDFS Sweeper will load all tables from source to HDFS that meet the size criteria set in the transform. HDFS Sweeper HDFS Library LASR Sweeper The LASR Sweeper will check to see if a copy of the source table is stored in HDFS. When that occurs it will load from HDFS instead of the source. CSV, XLS, or XLSX (FILENAME accessible) PRE Sweeper (Optional) Source Library (LIBNAME accessible) LASR Sweeper LASR Library The PRE Sweeper will convert objects that are referenced with a FILENAME statement to SAS data sets so they can be referenced by a LIBNAME statement. The LASR Sweeper will load all the tables from source to LASR based on a few filtering parameters that are set in the transform. Figure 2: Diagram of How the LASR Sweeper and the HDFS Sweeper Work Together LASR SWEEPER Much like the autoloader, the LASR Sweeper transform can load many data sets (tables or views) into SAS LASR Analytic Server automatically. The parameters to this transform give you complete control over the sweeper process. Furthermore, since this custom transform simply calls SAS macros, any experienced SAS developer can customized the transform further to meet your environment s requirements. Not all the default tabs available in the transform are used (i.e. Mappings, Table Options). The key tab that is used to set the parameters is the Options tab. These parameters are described in Table 2. Options: General Remove from LASR No (default): All tables will remain in LASR library (and HDFS) even if the Options library corresponding source is removed. : Tables will be removed from LASR library (and HDFS) when the corresponding source is removed. 3

Reload even if record count doesn t change Extended Options Data set filter (include) Data set filter (exclude) Restart LASR server Note: If PROC CONTENTS can return a modified date on the source data (such as Base SAS), then this option is ignored. No (default): The tables will be reloaded only when the number of observations is different in source and LASR tables. : The tables will be reloaded even when the number of observations does not change. This is useful when existing records in source tables change. Extended options for variable names (i.e. spaces in names) can be enabled. No (default): Strick rules for variable names must be followed : Variable names may have spaces and other characters in them. Much like Enterprise Guide allows by default. This will filter the tables to be loaded. Use % as a wildcard or use a delimited list for multiple tables (separated by a blank,, :, ;). For example: Contains VA : %VA% Starts with VA : VA% TableA and TableB: TableA TableB This will filter the tables to be skipped during the load. The syntax is the same as the Data set filter (include) syntax. This option will restart the LASR server prior to loading data. Not commonly used. Options: Source Source library Select the source library from the drop-down list. Further information can be found in Source metadata folder The folder location for the metadata objects of the source tables to be registered. Remove labels No (default): Will keep the labels/descriptions for the variables in the tables. (Optional) Source LIBNAME : Will remove the labels/descriptions. SAS Visual Analytics will use the first 32 characters of a label for the variables in a LASR table. If no labels are available, it will use the variable name. The reason this option was included is because sometimes the first 32 characters of a variable are not unique in a data set, but variable names are required to be unique. If the %SET_LIBRARY macro does not work on the source libref, you can manually enter a LIBNAME statement here. Further information can be found in the APPENDIX: Substructure for the transforms. Options: LASR LASR library Select the LASR library from the drop-down list. Further information can be found in LASR metadata folder The folder location for the metadata objects of the LASR tables to be registered. Update metadata (default): The metadata for the LASR tables will be refreshed each time it is loaded or reloaded. Options: HDFS (optional) Metadata update override: HDFS library HDFS metadata folder No: Metadata will not be updated after a reload. This is commonly used if the structure of the tables does not change, since it is more efficient. Note: This is an example of a customization that SAS IT put into place. Updates to LASR metadata can cause a performance hit to the overall environment. By default, the logic in the sweeper (with this customization) is that metadata doesn t update between 8am and 5pm (business hours). No (default): LASR metadata will not update during business hours. : LASR metadata will update during business hours. Select the HDFS library from the drop-down list. Further information can be found in The folder location for the metadata objects of the HDFS tables to be registered. Table 2: LASR Sweeper Parameters 4

HDFS SWEEPER For an environment that has a co-located HDFS environment, the HDFS Sweeper transform will be beneficial to put into the process flow just before the LASR Sweeper. This transform will load many data sets, tables or views into HDFS automatically. The important parameter to consider here is the size threshold to load into HDFS. This allows you to load all data sets greater than 10 GB (for example) into HDFS but skip the others. Then when the LASR Sweeper runs, it knows to look in the HDFS library before loading from source based on the HDFS parameters found in the LASR Sweeper. If a corresponding data set, by name, is in HDFS, it will lift that HDFS table (SASHDAT) to the LASR library instead of loading from source. The parameters in the HDFS Sweeper transform are described in Table 3. You will notice similarities to the LASR Sweeper. Options: General Options Minimum size to load to HDFS (in Gb) Tables in the source library will be loaded to HDFS if their size is equal to or larger than the value set in this parameter. The size is calculated by multiplying the number of records by the record length (no compression is taken into account). This is a drop-down list, but you can also type over the value to input anything you wish. Setting the parameter to 0 will load all source tables to HDFS. Remove from HDFS No (default): All tables will remain in HDFS even if the corresponding source is removed. Reload even if record count doesn t change Extended Options Data set filter (include) Data set filter (exclude) : Tables will be removed from HDFS when the corresponding source is removed. Note: If PROC CONTENTS can return a modified date on the source data (that is, Base SAS), then this option is ignored. No (default): The tables will be reloaded only when the number of observations is different in source and HDFS tables. : The tables will be reloaded even when the number of observations does not change. This is useful when existing records in source tables can be updated. Extended options for variable names (i.e. spaces in names) can be enabled. No (default): Strick rules for variable names must be followed : Variable names may have spaces and other characters in them. Much like Enterprise Guide allows by default. This will filter the tables to be loaded. Use % as a wildcard or use a delimited list for multiple tables (separated by a blank,, :, ;). For example: Contains VA : %VA% Starts with VA : VA% TableA and TableB: TableA TableB This will filter the tables to be skipped during the load. The syntax is the same as the Data set filter (include) syntax. Options: Source Source library Select the source library from the drop-down list. Further information can be found in Source metadata folder (Optional) Source LIBNAME The folder location for the metadata objects of the source tables to be registered. If the %SET_LIBRARY macro doesn t work on the source libref you can manually enter a LIBNAME statement here. Further information can be found in the APPENDIX: Substructure for the transforms. Options: HDFS HDFS library Select the HDFS library from the drop-down list. Further information can be found in HDFS metadata folder HDFS block size The folder location for the metadata objects of the HDFS tables to be registered. Select the best block size for HDFS storage. Automatic (default): This calculates a block size based on the table size and the number of data nodes on the grid. 5

HDFS copies The number of extra copies of the file blocks to save for replication and node failover. Increasing this number will also increase the amount of HDFS space that is needed to store these tables. Table 3: HDFS Sweeper Parameters PRE SWEEPER The sweeper data-load method is based on the ability to access and interrogate the data source via a SAS LIBNAME statement. If your data source is a file format, such as CSV, XLS and XLSX, that requires access via a SAS FILENAME statement, use the PRE Sweeper transformation. The PRE Sweeper was developed to convert these types of files to SAS data sets so that a LIBNAME statement can be used. This same methodology is used with the autoloader. For this transform to work correctly, you need to make sure the files are in a format that will work well as a SAS data set. For example, the first row needs to contain the column/variable names. These variable names must be valid SAS names. While the PRE Sweeper can handle many generic types of data imports, there might be nonstandard or complex imports that are currently beyond its scope. But by using the current PRE Sweeper macro as a guide, you can extend the capabilities to create more customizable import processes as needed. From there, the HDFS and LASR Sweepers will complete the data load. The parameters for the PRE Sweeper as shown in Table 4. Options: General Source folder The location of the file(s) to import. Options Source encoding The encoding of the input files. Target library Table 4: PRE Sweeper Parameters Select the target library from the drop-down list. This can be the same physical location as the source folder. Further information can be found in HDFS STACKER The sweeper technique is ideal when you have many different data sets you want to maintain independently. The tables need to stay separate. But what about when you have many data sets with the same structure that you want to eventually put into one LASR table? A great example for this scenario is analyzing daily web logs. After a day s file is complete, it never changes. Therefore, appending to what is already loaded is much more efficient. Since LASR tables can be dropped out of memory and lost forever (i.e. in the event of a LASR server reboot) a storage mechanism that is persistent is critical. The HDFS Stacker method does just that. Here are the basic steps: 1. If necessary, use the 2. PRE Sweeper or create a custom job to process each file. Store these files to a common location in HDFS (that is, in the same path) as a SASHDAT file. This can be done by using the HDFS Sweeper. Name the files with an informative naming convention to let you know what file is for what subset of data. In the weblog example, this would be something like weblog_yyyy_mm_dd, where YYYY, MM, and DD represent the year, month, and day of each file. 3. After you have all your data processed and stored in HDFS, point the HDFS Stacker to that HDFS location/path. It will then load each SASHDAT file to a single LASR table. Table 5 describes the parameters of the HDFS Stacker. Options: HDFS HDFS library Select the HDFS library from the drop-down list. Further information can be found in Data set filter (include) This will filter the tables to be loaded. Use % and? as wildcard characters. For example: 6

Contains VA : %VA% Starts with VA : VA% Patterned name like weblog_2015_04_28: weblog_????_??_?? Options: LASR LASR library Select the LASR library from the drop-down list. Further information can be found in LASR metadata folder LASR table name The folder location for the metadata objects of the source tables to be registered. The name of the single table that all the HDFS tables will be stacked into. Table 5: HDFS Stacker Parameters CONCLUSION SAS Visual Analytics has several powerful integrated applications and processes to load data into the SAS LASR Analytic Server. By using any of the PRE Sweeper, LASR Sweeper, HDFS Sweeper, or HDFS Stacker autoloading techniques, you can extend that capability. These custom transforms can benefit your users by allowing them to focus on reporting and exploring the data instead of loading the data. Furthermore, the ability to fully automate the data loading and refresh processes allows your developers to focus on more complex challenges. APPENDIX APPENDIX: SUBSTRUCTURE FOR THE TRANSFORMS The preceding transforms all require some underlying macros, tables, and jobs to work properly. All of these objects will be packaged together and available on the SAS Global Forum website for download. Here is a brief description of each object: Tab: Parameter Group %SET_LIBRARY Artifacts Library Update Artifacts Job Description A macro that queries the metadata repository and generates LIBNAME statements from libraries that are registered in metadata. A library that contains 1 table and 3 views. The table contains information about libraries that are registered in metadata. The views subset the table into source libraries, LASR libraries, and HDFS libraries. These views are used to populate the drop-down controls in the transforms. A job that updates the Artifacts Library. This job can be scheduled to run at a certain time or can be manually kicked off anytime a library in metadata is changed, added, or removed. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Jason Shoffner SAS Institute Inc. 919-531-2110 jason.shoffner@sas.com www.sas.com SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. 7