Getting the Most from Hash Objects. Bharath Gowda

Similar documents
SQL, HASH Tables, FORMAT and KEY= More Than One Way to Merge Two Datasets

Comparison of different ways using table lookups on huge tables

Merging Data Eight Different Ways

9 Ways to Join Two Datasets David Franklin, Independent Consultant, New Hampshire, USA

A simplistic approach to Grid Computing Edmonton SAS Users Group. April 5, 2016 Bill Benson, Enterprise Data Scienc ATB Financial

Accelerate Your Data Prep with SASÂ Code Accelerator

An Annotated Guide: The New 9.1, Free & Fast SPDE Data Engine Russ Lavery, Ardmore PA, Independent Contractor Ian Whitlock, Kennett Square PA

David Franklin Independent SAS Consultant TheProgramersCabin.com

Why Hash? Glen Becker, USAA

Countdown of the Top 10 Ways to Merge Data David Franklin, Independent Consultant, Litchfield, NH

Surfing the SAS cache

Top Coding Tips. Neil Merchant Technical Specialist - SAS

Merge Processing and Alternate Table Lookup Techniques Prepared by

The inner workings of the datastep. By Mathieu Gaouette Videotron

General Tips for Working with Large SAS datasets and Oracle tables

Quicker Than Merge? Kirby Cossey, Texas State Auditor s Office, Austin, Texas

USING HASH TABLES FOR AE SEARCH STRATEGIES Vinodita Bongarala, Liz Thomas Seattle Genetics, Inc., Bothell, WA

USING SAS HASH OBJECTS TO CUT DOWN PROCESSING TIME Girish Narayandas, Optum, Eden Prairie, MN

Hash Objects for Everyone

An Animated Guide : Speed Merges: resource use by common procedures Russell Lavery, Contractor, Ardmore, PA

Scalable Access to SAS Data Billy Clifford, SAS Institute Inc., Austin, TX

Sorting big datasets. Do we really need it? Daniil Shliakhov, Experis Clinical, Kharkiv, Ukraine

Choosing the Right Technique to Merge Large Data Sets Efficiently Qingfeng Liang, Community Care Behavioral Health Organization, Pittsburgh, PA

A SAS Macro Utility to Modify and Validate RTF Outputs for Regional Analyses Jagan Mohan Achi, PPD, Austin, TX Joshua N. Winters, PPD, Rochester, NY

Advanced Database Systems

Using a hash object to seed initial conditions

EXAMPLE 3: MATCHING DATA FROM RESPONDENTS AT 2 OR MORE WAVES (LONG FORMAT)

Efficiently Join a SAS Data Set with External Database Tables

********************** Code for the seminar 12 Apr 2012 ****************************;

Introducing the SAS ODBC Driver

Paper TT17 An Animated Guide : Speed Merges with Key Merging and the _IORC_ Variable Russ Lavery Contractor for Numeric resources, Inc.

Implementing external file processing with no record delimiter via a metadata-driven approach

INTRODUCTION TO PROC SQL JEFF SIMPSON SYSTEMS ENGINEER

Hash Objects Why Bother? Barb Crowther SAS Technical Training Specialist. Copyright 2008, SAS Institute Inc. All rights reserved.

CSC 261/461 Database Systems Lecture 19

Hashtag #Efficiency! An Exploration of Hash Tables and Merge Techniques

Fundamentals of Database Systems

3. Almost always use system options options compress =yes nocenter; /* mostly use */ options ps=9999 ls=200;

Paper Haven't I Seen You Before? An Application of DATA Step HASH for Efficient Complex Event Associations. John Schmitz, Luminare Data LLC

Effective ways of handling various file types and importing techniques using SAS 9.4

Programming Beyond the Basics. Find() the power of Hash - How, Why and When to use the SAS Hash Object John Blackwell

SAS ODBC Driver. Overview: SAS ODBC Driver. What Is ODBC? CHAPTER 1

Database Optimization

Power Query for Parsing Data

Chapter 12: Query Processing

A SAS Macro for Producing Benchmarks for Interpreting School Effect Sizes

SAS (Statistical Analysis Software/System)

Electricity Forecasting Full Circle

Database System Concepts

SAS System Powers Web Measurement Solution at U S WEST

Chapter 12: Query Processing. Chapter 12: Query Processing

Firebird Tour 2017: Performance. Vlad Khorsun, Firebird Project

BASICS BEFORE STARTING SAS DATAWAREHOSING Concepts What is ETL ETL Concepts What is OLAP SAS. What is SAS History of SAS Modules available SAS

SAS Scalable Performance Data Server 4.3 TSM1:

A Side of Hash for You To Dig Into

Interpreting Explain Plan Output. John Mullins

Data Storage. Query Performance. Index. Data File Types. Introduction to Data Management CSE 414. Introduction to Database Systems CSE 414

Overview of HASH Objects Swarnalatha Gaddam, Cytel Inc. Hyderabad, India

Examples of Physical Query Plan Alternatives. Selected Material from Chapters 12, 14 and 15

Table Lookups: From IF-THEN to Key-Indexing

Presenters. Paul Dorfman, Independent Consultant Don Henderson, Henderson Consulting Services, LLC

Welcome to the presentation. Thank you for taking your time for being here.

PROC SQL vs. DATA Step Processing. T Winand, Customer Success Technical Team

Project. CIS611 Spring 2014 SS Chung Due by April 15. Performance Evaluation Experiment on Query Rewrite Optimization

1. Join with PROC SQL a left join that will retain target records having no lookup match. 2. Data Step Merge of the target and lookup files.

SQL QUERY EVALUATION. CS121: Relational Databases Fall 2017 Lecture 12

Updating Data Using the MODIFY Statement and the KEY= Option

libname ora_data oracle schema=d user=d pw=d path="orcl";

Implementing Table Operations Using Structured Query Language (SQL) Using Multiple Operations. SQL: Structured Query Language

Join, Merge or Lookup? Expanding your toolkit

Chapter 13: Query Processing

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016

10 The First Steps 4 Chapter 2

CS-245 Database System Principles

Ingo Brenckmann Jochen Kirsten Storage Technology Strategists SAS EMEA Copyright 2003, SAS Institute Inc. All rights reserved.

Paper DB2 table. For a simple read of a table, SQL and DATA step operate with similar efficiency.

SAS Institute Exam A SAS Advanced Programming Version: 6.0 [ Total Questions: 184 ]

Passit4sure.P questions

TASS Interfaces Open Question

EXAMPLE 2: INTRODUCTION TO SAS AND SOME NOTES ON HOUSEKEEPING PART II - MATCHING DATA FROM RESPONDENTS AT 2 WAVES INTO WIDE FORMAT

Chapter 6: Modifying and Combining Data Sets

capabilities and their overheads are therefore different.

ORACLE DATABASE 12C INTRODUCTION

CMSC424: Database Design. Instructor: Amol Deshpande

Best Practices for Using the SAS Scalable Performance Data Server in a SAS Grid environment

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

Chapter 13: Query Processing Basic Steps in Query Processing

Paper CT-16 Manage Hierarchical or Associated Data with the RETAIN Statement Alan R. Mann, Independent Consultant, Harpers Ferry, WV

Programming in OOP/C++

Greenplum Architecture Class Outline

Workbooks (File) and Worksheet Handling

Query Processing and Advanced Queries. Query Optimization (4)

Decision Management with DS2

Techniques for Writing Robust SAS Macros. Martin Gregory. PhUSE Annual Conference, Oct 2009, Basel

INTRODUCTION TO SAS HOW SAS WORKS READING RAW DATA INTO SAS

AURA ACADEMY SAS TRAINING. Opposite Hanuman Temple, Srinivasa Nagar East, Ameerpet,Hyderabad Page 1

Integrating Large Datasets from Multiple Sources Calgary SAS Users Group (CSUG)

CS122 Lecture 15 Winter Term,

Advanced SQL Processing Prepared by Destiny Corporation

Oracle 9i Application Development and Tuning

Transcription:

Getting the Most from Hash Objects Bharath Gowda

Getting the most from Hash objects Techniques covered are: SQL join Data step merge using BASE engine Data step merge using SPDE merge Index Key lookup Hash object lookup Simple join, iteration and sorting with hash

Test setup Only using local SAS datasets. Datasets (including SPDE) reside in the local work library directory. Storing data locally nullifies the network delays and I/O delays. The environment has a super fast solid state hard disk which decreases I/O delay. Same join logic is implemented over all the techniques.

PROC SQL Requires no sorting prior to joins Multi way joins can be performed. SQL uses internal utility tables for joins Proc sql; create table acute_pred as select a.*,b.nwau_sas from pred_cost as a inner join speciality as b on a.yearid = b.yearid and a.facility_identifier= b.facility_identifier ; quit; Rows in pred_cost 13,160,905 Rows in speciality 16,170,805 Rows in acute_pred (inner join) real time user cpu time 12,001,616 0:20:28.79 0:14:15.31 Memory 254512.00k

Data step merge It requires a sort first by the key variables in all the input datasets. Proc sort data=pred_cost out= srt_pred_cost; by yearid facility_identifier; Run; Proc sort speciality out= srt_spec (keep= yearid facility_identifier nwau_sas); by yearid facility_identifier; Run; Data acute_pred; Merge srt_pred_cost(in=a) srt_spec(in=b); by yearid facility_identifier; If a and b; Run; Rows in pred_cost 13,160,905 Rows in speciality 16,170,805 Rows in acute_pred (inner join) real time user cpu time (avg combined sorting and merging) 12,001,616 0:30:28.79 0:21:15.31 Memory(combined avg) 162393.59k

SPDE merge SPDE stands for Scalable Performance data engine SPDE combines software and hardware capabilities. libname workspde spde "%sysfunc(pathname(work))" temp=yes; proc copy in=work out= workspde; run; select pred_cost specialty; data workspde.acute_pred; merge workspde.hiepred_acutecost_1718(in=a) workspde.nwau16 If a and b; run; (in=b keep= nwau_sas yearid facility_identifier); by yearid facility_identifier; Rows in pred_cost 13,160,905 Rows in speciality 16,170,805 Rows in acute_pred (simple join) real time user cpu time 12,001,616 0:16:15.31 0:12:28.79 Memory 1261040.70k

Index key lookup Main dataset is not sorted. Multiple set statements with key= option. _IORC_ is one of the key automatic variable which needs attention. Proc datasets lib=work nolist; Modify speciality; Index create myindex=(yearid facility_identifier); run; Data acute_pred; set pred_cost; Set speciality key=myindex ; If _IORC_ =0 ; run; Rows in pred_cost 13,160,905 Rows in speciality 16,170,805 Rows in acute_pred (inner join) real time user cpu time 12,001,616 0:12:15.31 0:06:28.79 Memory 142279.71k

HASH Objects One of the fastest approaches for the look up activities. In memory computation makes this technique the fastest. Dataset size is a key factor for memory consumption. Pre sorting is not required. Hash object needs to be defined and instantiated. Lookups are performed with the find() function. Data acute_pred; length nwau_sas 8.; If _n_=1 then do; declare hash hn16(dataset: speciality,duplicate: e ); hn16.definekey( yearid, facility_identifier ); hn16.definedata( nwau_sas ); hn16.definedone(); end; set pred_cost; Rc=hn16.find(key :yearid, key:facility_identifier); If rc=0; Run;

HASH Objects Simple inner joins and left joins can be achieved by keeping a check on the return code variable. Rows in pred_cost 13,160,905 Rows in speciality 16,170,805 if rc=0; Rows in acute_pred (inner join) 12,001,616 real time user cpu time 0:08:28.79 0:03:15.31 Memory 1465437.28k

HASH Iterators Hash iterator objects need to be defined and instantiated. Hash objects are assigned to hash iterator objects. Data top bottom; length nwau_sas 8.; If _n_=1 then do; declare hash hn16(dataset: speciality, ordered: descending ); hn16.definekey( yearid, facility_identifier ); hn16.definedata(all: Y ); hn16.definedone(); Declare hiter iter_ex( hn16 ); End; /*hn16.output(dataset: sort_dec )*/

HASH Iterators First() and next() functions to iterate top to bottom and last() and prev() functions to iterate bottom to top Can be used to get the top and bottom records. Declare hiter iter_ex( hn16 ); End; Iter_ex.First(); do I = 1 to 10; output top; Iter_ex.Next(); end; Iter_ex.last(); do I = 1 to 10; output bottom; Iter_ex.prev(); end; Run;

Avoid or use? Where should you apply these techniques? Type Proc SQL Datastep Merge Index lookup (key=) SPDE Merge Hash lookup Hash Iterators Small to medium sized tables (< 10,000 rows) Huge table (> 10 million rows) Unsorted data(> 10 million rows) Datasets with multiple indexes Less memory hungry techniques

Getting the most from Hash objects Bharath Gowda, SAS analyst Independent Contractor 0468304568 bharathg1307@gmail.com