Surfing the SAS cache

Similar documents
The Right Read Optimization is Actually Write Optimization. Leif Walsh

XP: Backup Your Important Files for Safety

It s not the Yellow Brick Road but the SAS PC FILES SERVER will take you Down the LIBNAME PATH= to Using the 64-Bit Excel Workbooks.

Embedded Technosolutions

CSE332: Data Abstractions Lecture 7: B Trees. James Fogarty Winter 2012

Pharmacy college.. Assist.Prof. Dr. Abdullah A. Abdullah

How To Rock with MyRocks. Vadim Tkachenko CTO, Percona Webinar, Jan

Ten tips for efficient SAS code

50 WAYS TO MERGE YOUR DATA INSTALLMENT 1 Kristie Schuster, LabOne, Inc., Lenexa, Kansas Lori Sipe, LabOne, Inc., Lenexa, Kansas

DB2 is a complex system, with a major impact upon your processing environment. There are substantial performance and instrumentation changes in

2

Microservices Smaller is Better? Eberhard Wolff Freelance consultant & trainer

Getting the Most from Hash Objects. Bharath Gowda

Advanced Database Systems

DIGITALGLOBE ENHANCES PRODUCTIVITY

Processor: Faster and Faster

DATABASE PERFORMANCE AND INDEXES. CS121: Relational Databases Fall 2017 Lecture 11

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory I

Best Practice for Creation and Maintenance of a SAS Infrastructure

How TokuDB Fractal TreeTM. Indexes Work. Bradley C. Kuszmaul. Guest Lecture in MIT Performance Engineering, 18 November 2010.

THE MORE THINGS CHANGE THE MORE THEY STAY THE SAME FOR BACKUP!

Balancing the pressures of a healthcare SQL Server DBA

Why Hash? Glen Becker, USAA

Another fundamental component of the computer is the main memory.

Example File Systems Using Replication CS 188 Distributed Systems February 10, 2015

The Implications of Multi-core

Enhancing Security With SQL Server How to balance the risks and rewards of using big data

In-Memory Data Management Jens Krueger

Estimate performance and capacity requirements for Access Services

WHITE PAPER. Apache Spark: RDD, DataFrame and Dataset. API comparison and Performance Benchmark in Spark 2.1 and Spark 1.6.3

Physical DB Issues, Indexes, Query Optimisation. Database Systems Lecture 13 Natasha Alechina

Strategy. 1. You must do an internal needs analysis before looking at software or creating an ITT

Intro to Algorithms. Professor Kevin Gold

Implementing Oracle Database 12c s Heat Map and Automatic Data Optimization to optimize the database storage cost and performance

Object-Oriented Analysis and Design Prof. Partha Pratim Das Department of Computer Science and Engineering Indian Institute of Technology-Kharagpur

ECE Lab 8. Logic Design for a Direct-Mapped Cache. To understand the function and design of a direct-mapped memory cache.

Computer Systems C S Cynthia Lee Today s materials adapted from Kevin Webb at Swarthmore College

Massive Scalability With InterSystems IRIS Data Platform

Search Lesson Outline

CAPACITY PLANNING FOR THE DATA WAREHOUSE BY W. H. Inmon

Sorting. 4.2 Sorting and Searching. Sorting. Sorting. Insertion Sort. Sorting. Sorting problem. Rearrange N items in ascending order.

CS 61C: Great Ideas in Computer Architecture. The Memory Hierarchy, Fully Associative Caches

So, coming back to this picture where three levels of memory are shown namely cache, primary memory or main memory and back up memory.

Monitoring Tool Made to Measure for SharePoint Admins. By Stacy Simpkins

CSCI-GA Multicore Processors: Architecture & Programming Lecture 3: The Memory System You Can t Ignore it!

Staleness and Isolation in Prometheus 2.0. Brian Brazil Founder

How do you transform your business into a business service center?

Data in the Cloud and Analytics in the Lake

Notes From SUGI 24. Jack Hamilton. First Health West Sacramento, California. SUGI24OV.DOC 2:45 PM 29 April, 1999 Page 1 of 18

4.1, 4.2 Performance, with Sorting

_APP A_541_10/31/06. Appendix A. Backing Up Your Project Files

Main Memory (RAM) Organisation

Generalising LUTI Models to Systems of Cities: Web-Based Interfaces to Simulation

Managing the Database

Data-Intensive Distributed Computing

CS 31: Intro to Systems Caching. Kevin Webb Swarthmore College March 24, 2015

Native POSIX Thread Library (NPTL) CSE 506 Don Porter

Elementary IR: Scalable Boolean Text Search. (Compare with R & G )

Outline. Course Administration /6.338/SMA5505. Parallel Machines in Applications Special Approaches Our Class Computer.

A Technical Marketing Document

Tips & Tricks. With lots of help from other SUG and SUGI presenters. SAS HUG Meeting, November 18, 2010

Flash Decisions: Which Solution is Right for You?

CSE 141: Computer Architecture. Professor: Michael Taylor. UCSD Department of Computer Science & Engineering

How TokuDB Fractal TreeTM. Indexes Work. Bradley C. Kuszmaul. MySQL UC 2010 How Fractal Trees Work 1

Archive-Tools. Powering your performance

ECE 152 Introduction to Computer Architecture

New Vs. Old Under the Hood with Procs CONTENTS and COMPARE Patricia Hettinger, SAS Professional, Oakbrook Terrace, IL

Lecture S3: File system data layout, naming

Computer Science 432/563 Operating Systems The College of Saint Rose Spring Topic Notes: Memory Hierarchy

Bits and Bytes. Here is a sort of glossary of computer buzzwords you will encounter in computer use:

Lecture 12. Lecture 12: The IO Model & External Sorting

Memory Hierarchy, Fully Associative Caches. Instructor: Nick Riasanovsky

(Refer Slide Time 3:31)

The attendee will get a deep dive into all the DDL changes needed in order to exploit DB2 V10 Temporal tables as well as the limitations.

Background. Let s see what we prescribed.

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi.

Media-Ready Network Transcript

Chapter 2: Universal Building Blocks. CS105: Great Insights in Computer Science

5 Fundamental Strategies for Building a Data-centered Data Center

The TokuFS Streaming File System

Scalable Access to SAS Data Billy Clifford, SAS Institute Inc., Austin, TX

Extreme Storage Performance with exflash DIMM and AMPS

An Annotated Guide: The New 9.1, Free & Fast SPDE Data Engine Russ Lavery, Ardmore PA, Independent Contractor Ian Whitlock, Kennett Square PA

TWOO.COM CASE STUDY CUSTOMER SUCCESS STORY

Distributed Data Infrastructures, Fall 2017, Chapter 2. Jussi Kangasharju

Computer Architecture Review. ICS332 - Spring 2016 Operating Systems

PROCESS VIRTUAL MEMORY. CS124 Operating Systems Winter , Lecture 18

CS 147: Computer Systems Performance Analysis

Vertica s Design: Basics, Successes, and Failures

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search

Computers: Inside and Out

Memory Hierarchy. Memory Flavors Principle of Locality Program Traces Memory Hierarchies Associativity. (Study Chapter 5)

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2014 Lecture 14

Anomaly Detection vs. Pattern Recognition

Subject: Top-Paying IT Certificates for 2015 (And Our New Courses)

The 21 WORD . That Can Get You More Clients. Ian Brodie

THE TRUTH ABOUT SEARCH 2.0

Don t Forget About SMALL Data

4.2 Sorting and Searching. Section 4.2

Introduction to OS. Introduction MOS Mahmoud El-Gayyar. Mahmoud El-Gayyar / Introduction to OS 1

Transcription:

Surfing the SAS cache to improve optimisation Michael Thompson Department of Employment / Quantam Solutions

Background Did first basic SAS course in 1989 Didn t get it at all Actively avoided SAS programing had very capable team members to do it for me Went consulting in 1999 First interviewed was asked a lot of SAS questions which I could answer because I had had good SAS people working for me Wasn t asked if I could program in SAS! Only realised starting on day one that SAS was required Had to learn via SAS tech support web page really quickly Discovered quite quickly that SAS proc SQL was both quick to program and ran fast Yes I am an unashamed SQL zealot!

Let s make this interactive Don t be shy! Better Ideas Observations Insights Might be 500-1000 years of SAS experience in the room!

What is going on that affects us Budget cuts drive for more efficiency We must do more with less Demands for more responsiveness to support the business of our organisations Government must follow private enterprise and understand at a finer level the attributes and needs of the public we serve and be able to quickly measure effectiveness of both new and old policy

What is going on that affects us Big Data (Sure it is one of the latest buzzwords, however ) The data is growing massively and will not stop Maybe in the next few years more data will be created than in the last 40,000 With so much data around, an important skill will be the ability to discern which data to ignore But also the creativity to identify surprising new ways to use new data to serve our organisations

What is going on that affects us The advent of the micro-policy is coming. Policies benefiting small numbers of people Developed and implemented quickly Evaluated quickly If they fail make sure they fail fast To facilitate concepts like micro-policies it is up to us ensure our IT areas have the right data available, are even more flexible, responsive and our output is trusted (both by perception and reality)

What is going on that affects us Trend to open public data to scrutiny Obama s second administration directed government (taking account of privacy and national security) to publish government data and open it to scrutiny Given enough eyeballs all insights are shallow Slight twist on Linus Law (Linix creator) crowdsource research May mean our data processes need to be more robust and we may have an increased workload We may more often need to verify insights identified in our data by others

What used to be most important For SAS programmers (creativity aside) Writing code so it executed quickly Getting IF tests in the right order reducing tests executed Using ELSE to avoid executing un-needed IF tests Sub-setting where statements after set ing data Keeping numbers of fields and their length down to a minimum

Now things have changed Moore s law has continued to work CPU speeds have increased massively Disk storage is getting larger and cheaper per Gb

Moore s Law Transistor counts on integrated circuits double every 2 years (Ref: Wikipedia Moore s Law)

Why change the way I do things now? Computers will just get faster and help me keep up! Unfortunately the speed data can be read, written and transmitted has not kept up This means that as our data volumes increase, even though processors are getting faster and storage is getting bigger and cheaper, read/write speeds are not keeping up It doesn t matter how fast our CPUs are or how efficiently we write our code if the bottle neck is the reading data from and writing data to storage

As a SAS programmer what things can we do? Write our programs or processes more quickly helps no matter what the data volumes Where our data volumes are high, engineer processes which optimise/minimise IO

As a SAS programmer what things can we do? (to speed up joining data from 2 or more tables) I personally believe that at the moment (and this may change as we more and more utilise solid state memory) the best thing we can do to speed up our SAS is to understand how the cache operates and work with it. When SAS reads data from storage it doesn t just read 1 record It reads many into cache.

As a SAS programmer what things can we do? (to speed up joining data from 2 or more tables) The #1 thing we can do to optimise the cache is to ensure tables we are joining are sorted by the key variable we are joining with and preferably indexed by that variable as well This means that when we start to read two tables into memory as we join the first records from each table we have also just read the data for thousands of subsequent matches.

As a SAS programmer what things can we do? (to speed up joining data from 2 or more tables) This utilisation of the data flowing through cache can be further enhanced by ensuring the tables are compressed (preferably using SPDE binary compression) If you really need to make the matched even faster ensuring that only the fields needed in the output are in the input files. Allowing even more records to be read in each bite of the cache

Caveats Many of these advantages can be lost if more than 2 tables are joined at the same time. SAS internally if joining 3 tables actually joins 2 tables, writes the output to work and then joins the work table to the 3 rd. Unfortunately work files are not SPDE compressed so the reads and writes to work are slow.

Work around only do this with great care. Some SAS processes will fail if you follow this idea. EG SAS Graph /* The following code gets SAS to utilise SPDE compression by default for work files be careful!!!*/ options obs=max compress=binary; %let path=%sysfunc(pathname(work)); libname s spde "&path"; run; options user=s ;

Questions

#1 learn SQL As a SAS programmer what things can we do? Data steps describe the path to solving a problem SQL semantically describes the answer and delivers it When joining tables (datasets SQL almost always out performs SAS merge) Short SQL programs can often replace processes with 100 s of lines of code Can avoid complicated data steps containing retain and set by statements

Results of testing file types No order no index Regular SAS DS Random order indexed Index ordered & indexed Random order indexed SPDE Index ordered & indexed Elapsed 2m17s 14m34s 1m21 1hr15m54 1m05s Merge User CPU 16s 2m08s 23s 1hr25m13 31s SQL Sys CPU 4s 2m47s 6s 5m13s 6s Elapsed 1m23s 50s 41s 31s 24s User CPU 14s 25s 16s 27s 20s Sys CPU 8s 6s 6s 2s 1s A join between 10 million and 30 million row tables Also shows the benefit of SPDE format for SAS tables (datasets) Note the worst performing SQL join was only just slower than the best merge

Joining/merging data There are many ways to join or merge data and give the correct result. All are not equal when choosing strategy what matters is :- Speed of coding Speed of execution (run-time/processor time) Maintenance Size of data

Joining/merging data What makes a difference... file sizes sort order indexing technique (merge/sql join/sort/other) will code be run regularly or one-off? Will SPDE help?

Join Types Traditional Join Options SQL join Merge

Joining/merging data 15 proc sql; 16 create table CALD_Profile as 17 select a.ssr 18,start 19,a.bentype 20,b.COB 21,end 22 from REDlast.benhist as a join 23 redlast.customer as b 24 on a.ssr eq B.ssr 25 where a.end eq. 26 27 ; NOTE: Table WORK.CALD_PROFILE created, with 4662787 rows and 5 columns. 28 quit; NOTE: PROCEDURE SQL used (Total process time): real time 19.93 seconds user cpu time 20.26 seconds system cpu time 1.82 seconds 16 data CALD_Profile_MG (keep=ssr 17 start 18 bentype 19 COB 20 end); 21 Merge REDlast.benhist (in=a) 22 redlast.customer ; 23 by ssr ; 24 if end eq. and a ; 25 run; NOTE: There were 30752526 observations read from the data set REDLAST.BENHIST. NOTE: There were 11543751 observations read from the data set REDLAST.CUSTOMER. NOTE: The data set WORK.CALD_PROFILE_MG has 4662787 observations and 5 variables. NOTE: DATA statement used (Total process time): real time 50.01 seconds user cpu time 43.86 seconds system cpu time 24.24 seconds

As a SAS programmer what things can we do? #2 Look after the IO (when data volumes are high) Understand how you can make the cache work for you The order of your data can matter a lot Learn how SPDE format data can help

Demo Joining/merging data sort order matters!

SPDE makes big difference IO/footprint High density data can actually increase in size ~25% when regular SAS compression is applied SPDE binary compression can cut size of same dataset in half faster to write faster to read! Remember SPDE files CAN NOT BE MOVED outside SAS NOTE: Compressing data set JKN.RAND_ORDER_INDEXED_MG increased size by 27.59 percent. Compressed is 47549 pages; un-compressed would require 37266 pages. NOTE: MODIFY was successful for JKS.BENHIST_RAND_ORDER_INDSPD.DATA. NOTE: Compressing data set JKS.BENHIST_RAND_ORDER_INDSPD decreased size by 71.36 percent.