Improving Performance and Ensuring Scalability of Large SAS Applications and Database Extracts

Size: px

Start display at page:

Download "Improving Performance and Ensuring Scalability of Large SAS Applications and Database Extracts"

Roland Simon
5 years ago
Views:

Improving Performance and Ensuring Scalability of Large SAS Applications and Database Extracts Michael Beckerle, ChiefTechnology Officer, Torrent Systems, Inc.

1 Improving Performance and Ensuring Scalability of Large SAS Applications and Database Extracts Michael Beckerle, ChiefTechnology Officer, Torrent Systems, Inc., Cambridge, MA ABSTRACT Many organizations that have invested heavily in data warehouses, data mining systems, and other large business intelligence systems are now discovering that the sheer volume of data being processed prevents large jobs from completing within the batch window. In some notable cases, long runtimes have been crippling--preventing the data warehouse from achieving its goals. In many cases, though, the long runtimes represent an inefficient use of the data warehouse that exacts other significant costs, for example, preventing the organization from running the job more frequently. Many organizations are therefore looking to cut batch runtimes by using multiprocessor systems with parallel processing capabilities-for example, SMPs, MPPs, and clusters. Parallel databases like DB2 UDB makes excellent use of these machines because they partition data across multiple nodes and process all database functions--including insert, update, and delete--in parallel. However, like many other applications, SAS jobs do not take advantage of the parallel processing capabilities of multiprocessor computers--even when working with a parallel database. Because a SAS process can connect to only one DB2 UDB node at a time, all data must be passed to that node and then funneled through a single cursor to the SAS application. This process creates significant bottlenecks-essentially serializing the parallel database. Eliminating this sequential bottleneck would appear to be a straightforward matter of converting sequential SAS applications to run in parallel. Unfortunately, most users believe that the way to achieve parallel processing is through parallel programming, and they recognize that parallel programming is a complex, costly, and even arcane science. Few organizations regard it as a commercially viable undertaking. New technology now makes it both technically and economically feasible to achieve parallel processing without parallel programming. Torrent Systems in Cambridge, MA, has developed an enviro.nment for building and running high-performance parallel applications which shields programmers from the complexities of parallel programming. Their SAS product-orchestrator for SAS (}--provides facilities to allow developers to easily specify how multistep, parallel operations should be set up; it readily feeds the output of one step as input to another; and it provides facilities for managing parallel execution. THE TESTS To prove the feasibility of using to cut runtime by converting a sequential SAS application to parallel mode, Torrent Systems and IBM developed and ran a set of representative workloads to test the performance impact of applying parallel processing to data extraction and SAS analysis. The test team ran two tests, both on an IBM RS/6000 SP massively parallel system. The first test extracted data from a parallel IBM DB2 UDB database and passed it to a SAS application. The second test used an actual SAS inventory forecasting application, which processed data from flat files. Each test was first run sequentially and then in parallel, incrementally increasing the number of nodes to measure scalability. The tests demonstrated that SAS performance increased dramatically when the extracts and processing were run in parallel instead of sequentially. The tests also showed that parallel processing allowed the applications to process the workload faster by as the number of processors increased. This phenomenon--called scalability-was exhibited consistently throughout the tests. In fact, results showed linear scalability through all 32 processors. (An example of linear scalability is where 12 processors provide 12 times the performance of a single processor). In addition, because the Torrent software enabled and managed the conversion of the application to parallel mode, the developers achieved the full benefits of parallel processing--performance and scalability-without parallel programming. : The Pathway to Parallel SAS Torrent Systems' Orchestrator for the SAS System () enables SAS applications to execute in parallel and to interface in parallel with the DB2 UDB parallel database system on both SMP and MPP architectures. increased SAS performance by allowing the system to: Extract data in parallel from a parallel RDBMS Load the results of SAS programs back into the database in parallel Process parallel data streams with parallel instances of a SAS DATA or PROC step for much higher throughput rates Store large s in parallel, providing faster access and eliminating storage restrictions Stream data between SAS steps without having to write intermediate results to disk. The bottom line is that, with, SAS can process much larger volumes of data. is a commercially available and supported product designed to solve the parallel interface issue with minimal 286

2 user effort, allowing SAS users to insert simple parallel directives into SAS programs. SAS algorithms that can be partitioned into discrete components can be made parallel using a Multiple Processor Independent Data (MPID) parallel paradigm. In fact, nearly any SAS DATA or PROC step with a BY clause could be run in parallel. SQL statements that use a GROUP BY clause illustrate a class of statements that could execute in parallel. Each SAS processor would work on a separate "GROUP" or "PARTITION" of the data, allowing data extracted from a database in parallel to be efficiently processed local to the node containing the extracted data. This requires no processor-to-processor communication to move data. Figure 1 shows a typical sequential SAS/Access database extraction vs. an -enabled SAS/Access parallel database extraction. Figure 1: Sequential vs. Paia/fel Elrtraclion into SequenJial SAS oats Set uential Extraction.. ~.1... ~ ~~- ~node Testing Methodology To ensure that the workloads and configuration were realistic and that the comparisons between sequential and parallel processing were fair and meaningful, testers used real customer applications. One test used an actual SAS inventory forecasting application in use at an IBM customer site. This customer was using an IBM RS/6000 SP system, but running SAS applications on a single node, not taking full advantage of the system's multiple processors. The tests described here were run at the IBM Teraplex Integration Center in Poughkeepsie, NY TEST MACHINE CONFIGURATION Both uniprocessor and SMP nodes were used in the system configuration--a typical configuration in customer environments-which allowed measurement of the performance characteristics of each type of node. The software used on each node varied depending on the test being run. For example, in some tests, SAS applications ran on only one node, whereas in other tests they ran on all nodes. The DB2 UDB database was configured on different numbers of nodes (4,8, 16, or 32), depending on the test being run. Census data was used, and the database was partitioned by serial number for 4, 8, 16, and 32 thin nodes. Test 1: Parallel Extraction from the IBM DB2 UDB Database Extraction of data from a parallel database is typically the first step in setting up SAS processing of data that has been stored in RDBMS. It is also a good example of a process that can benefit from parallelism. Because sequential extraction requires that data distributed across a parallel table of an RDBMS be streamed from the various table nodes to a coordinator node-and then streamed out of the database-the flow of data is bottlenecked by the rate at which the coordinator node within the database can read and then write the data. In the test, a sequential extraction and a parallel extraction were performed on a table spread out over 4, 8, 16, and then 32 nodes ofthe database, with the volume of data extracted from the table held constant throughout all tests. The tests extracted an entire table composed of 5 million rows (records) of census data. Each record was 304 bytes long and composed of 126 fields. Total data volume was 1.52 Gbytes. Results of Parallel Extraction Test In the sequential extraction (represented by the left side of Figure 1), SAS/Access used a single stream to extract the data from the database. The "serialization of the distributed data" occurred when data had to be moved from node to node to then stream into the single coordinator node of the database. The parallel extraction (represented by the right side of Figure 1) eliminated the need to move data from node to node and also streamed the data out of the database in parallel. A sequential step under the control of then merged the data into a single file outside the database. This single then serves as a data mart for further SAS analysis. A key feature of is its ability to land data to disk in parallel. Figure 2 (page 3) shows a test that ran as a variation of the parallel extraction test shown in Figure 1. In this test, developers used to parallelize the extraction of 1.52 gigabytes of data from the database. ran a SAS/Access process on every node, which allowed each process to extract only the data local to its node. In this way, enabled SAS to extract all of the data from the parallel table without a sequential bottleneck at the coordinator node. 287

Figure 2: Parallel Extraction into Parallel Data Sets node 1 node 2 DB2UDB.... node3,.... As shown In Figure 3, provided near linear scalability of the data extraction process.

3 Figure 2: Parallel Extraction into Parallel Data Sets node 1 node 2 DB2UDB.... node3,.... As shown In Figure 3, provided near linear scalability of the data extraction process. (Perfect linear scalability would resuh when doubling the number of processors cuts the runtime exactly in half, increasing the number of processors by four times cuts the runtime to one quarter, and so on.) The black curve represents the actual data points. The heavy gray curve shows the theoretical scalability limit that exists for applications with no sequential bottlenecks. Figure 3: Scalability of Parallel Extraction to Sequential SAS Data Set ~r ~ ~~~~~ U~ ~~c_~~----~ 1 6 l ,:7""?~~~~ " ~c...,.:::;...,.,._--i :.z~ ~~~~--~ ~----~~~ ~ 4~~~ This step replaces the sequential step that merged the data into the single SAS with multiple parallel steps that move the data into parallel s. Table 1 shows the performance results ofthe parallel extraction from the parallel database, both to a single SAS (labeled N-to-1) and to parallel s (labeled N-to-N). The subheading of "high" and "thin" designates the data mart node type. Table 1: Runtimes for Extraction of Data from DBZ UDB Database High Node Thin Node SeauentiaiExUaction 32-wavto 1 3:16:34 2:48:42 Parallel Extraction 4-way to 1 30:12 30:10 S-way to 1 16:01 16:01 18-way to 1 08:35 08:48 32-wayto 1 08:39 05:13 30-way to 30 N//A 04:07 32-wav to 32 04:17 NIA Notice the times for the 32-way-to-1 sequential extraction of the data to a single SAS : 3:16:34 and 2:48:42, for the high and thin nodes, respectively. Compare these times to those for a 32-way-to-1 parallel extraction to a single SAS : 06:39 and 05:13. The difference in these run times is the time it takes to converge a 32-way table to a single file stream inside the database (sequential extraction) vs. outside the database (parallel extraction). These results demonstrate that is extremely efficient at streaming large volumes of data. Next, notice the drop in time in each of the columns as the extractions progress from the "4-way- to-1" extraction to the "32-way-to-1" extraction. In this progression, the same volume of data is extracted from a table that is spread out over 4, 8, 16, and then 32 nodes of the database. As the number of processors used to extract the data increases, performance begins to deviate from 1:1 scalability, and even more so after 16 nodes. This deviation results from the inevitable sequential bottleneck created when the extracted data merges into a single sequential SAS file. With many parallel clients, parallel. extraction is so fast that regardless of whether data streams from 30, 32, or even 128 extraction nodes, a bottleneck occurs when writing data to disk on a single node. In Table 1, the difference in runtime between the "32-way-to-1" extraction time (06:39) and the "32-way-to- 32" extraction time (04:17) is basically the time that it takes to write the data to a single file. By writing to multiple parallel s instead of to a single sequential SAS, increased extnrction performance to provide almost perfect scalability. allows you to avoid sequential bottlenecks in production-mode applications by loading data into parallel s. These parallel s can, in tum, be fed directly to parallel instances of the SAS application. TEST TWO: PARALLEL SAS APPLICATION This second test measured the effects of parallelizing" a large commercial inventory-forecasting model, a SAS program with 526 lines of code. The code covered twenty DATA steps and 33 PROC steps, including linear regression, freq, sort, SQL, summary, transpose, and more. This application currently runs in sequential mode at an IBM customer site. The customer stores the data in a flat file. The volume of data processed was fixed at a constant volume during all phases of the test. The execution of a complex sequential SAS application against muhiple gigabytes of data can be a very timeconsuming operation. Because every record must pass through the same CPU, the resulting bottleneck of 288

4 sequential processing can far exceed even that observed in sequential database extraction. Speed is constrained not only by the rate at which data can stream through a single node, but also by the speed of the processor. Segment 1: Sequential read In the first segment of this test. took the sequential flat file and converted it to a SAS. then converted the SAS into parallel SAS s by partitioning it onto multiple nodes of the system. then measured the difference in performance between writing the s to high nodes vs. writing them to thin nodes. This segment requires a very small fraction of the time needed to run the actual SAS application. The output from this section was used as input to the next step, the execution of the SAS application. Section 2: Parallel execution In the second segment of the test, executed the SAS model in parallel. The more nodes executing the SAS model, the shorter the run time. Because each CPU needed to process only the data local to its node-and the data was divided up equally among N nodes (where N=1, 2, 4, 8, 16, or 32}--each CPU had to process only 1/Nth ofthe total data. Therefore, the whole model executed in 1/Nth the time that had been required to run the SAS model sequentially. After processing, wrote the SAS out to disk in parallel. Section 3: Sequential write In the final segment of the test, the parallel created by the SAS model was read into SAS in parallel, and then merged into a single sequential SAS, the output most easily inspected by the data analyst. Results of Parallel SAS Application Test Table 2 Parallel Application Runtimes N Hiah node Thin node Segment 1: flat file to N-way 1-wav 12:02 10:17 2-wav 07:34 07:43 4wav 06:41 07:07 B-wav 07:15 07:17 16-wav N/A 07:26 32-way MIA 07:00 1-way 3:19:02 1:42:29!Segment 2: parallel SAS application 2-wav 1:10:34 36:36 4wav 31:23 15:56 8-wav 15:23 07:08 16-wav N/A 04:19 32-wav N/A 02:36 Seament 3: N-wav to 1-wav 1-way 01:01 01:19 2-way 01:00 01:01 4wav 01:03 01:02 8-way 01:06 01:04 16-way N/A 01:08 32-wav N/A 01:15 Table 2 shows the results for the parallel SAS application test. The second segment runs the actual SAS application and is therefore the most time-consuming and CPU-intensive. This segment actually scales superlinearly: as the number of processors doubles, the run time is cut by more than a factor of two. Figure 4 shows a composite bar chart for the run time of the entire application as a function of the number of nodes executing the application. Figure 4: SAS Application Runtime vs. Number of Pro<:<1ssors To capture data on the runtimes of each segment of the execution, data was written to disk. The parallel model was run on 1, 2, 4, 8, 16, and 32 thin nodes ofthe RS/6000 SP and on 1, 2, 3, 4, 5, 6, 7, and 8 CPUs ofthe SMP. Running the parallel model on one node is equivalent to running the sequential model. Pipelined Parallelism.. 40, 20 "" 0 32 In a normal SAS application, each SAS step executes until completion, passing all its data at once to the next step, which executes to completion, and so on. This means that at any one time, most segments are "blocked" as they wait for the previous segment to complete. In contrast, applications containing multiple segments-such as the SAS application discussed here-<:an take advantage of "pipe lined parallelism. With pipelined parallelism. data flows into and out of each segment continuously. so that each segment of the application is always processing data-as long as input data is available. A segment will be inactive only if there is no input data available to process. Figure 4 shows that as processors were added, the runtime of segment 2 decreased dramatically, while the runtimes of the sequential segments (1 and 3} remained the same. Programs that have sequential bottlenecks-like those found in segments 1 and ~nnot scale unless the sequential bottleneck from reading from and writing to sequential is removed. can remove the sequential bottleneck by creating parallel s when extracting data from the database. In this way, eliminates all the run time associated with segment 1, making the application as a whole scale even better. 289

5 Figure 5 shows that, up to eight processors, segment 2 exhibited super-linear speedup. Figure 5: Super-Unear Speedup of SAS Forecasting Model ~ ~-- M~ ~~~~~--- 32r ~~~~--~~- ~r ~~~~--~~---- ~U~ ~~--~~~~--- ~ 20 r ~16~----~~C---~~ ~ 12r---~~~~~ ~ ~~--~~~ ~~~ ~----~--~~-=--~--~~=-~=-~ Complex analysis applications run against large volumes of data can be very memory-intensive. When memory limits are exceeded, virtual memory paging and even thrashing can occur. By distributing data in parallel, can provide super-linear speedup because it reduces the memory requirements on any single node below the threshold at which these slow processes occur. Figure 5 provides dramatic illustration of how distributing data over many nodes can lighten the load placed on the memory of each individual node and thereby reduce the overall work performed by the system. CONCLUSION The benchmark performance studies of Torrent System's undertaken at the IBM Teraplex Center have demonstrated scalable performance can be achieved when executing database extracts and SAS applications in the parallel environment on IBM SMP and MPP hardware. Several significant conclusions can be drawn from these results: distributes memory-intensive applications across multiple nodes, thereby optimizing the memory processing. Re-partitioning a parallel database can be done simply and efficiently, but not with traditional parallel programming techniques. In any environment, it is necessary to periodically add nodes or reconfigure a system. Without a tool like, it can be a daunting task to re-partition a parallel database to take advantage of the additional nodes. 's ability to stream large s in parallel into and out of a parallel RDBMS allows redistribution of tables within the database very simply and easily. For example, the database used for this test was re-partitioned repeatedly in order to measure processing scalability with different node configurations. Finally, as the needs of database users evolve, the ability to dynamically re-partition tables "on the fly" for the purpose of executing a particular operation will increase in value. All Parallel, All The Time Organizations performing data extract and refinement, data warehouse and data mart loading, models, analysis, data mining, and other such large-scale batch jobs now have a cost-effective way to handle growth in data volume and application complexity. By implementing end-to-end parallelism with using parallel hardware, parallel databases, parallel applications and the Orchestrate for SAS development and runtime environment from Torrent Systems, users can create an enterprise environment that will scale to many times its sequential capacity. And because Torrenfs software shields programmers from the complexities of parallel programming, end-to-end parallelism-and the enormous benefits it brings-is available to all IT organizations, without the need to invest in parallel programmers. Extracts from a parallel database should be done in parallel: Because sequential extraction requires that data distributed across a parallel table be streamed from the various table nodes to a coordinator node-and then streamed out of the database-the flow of data is bottlenecked by the rate at which the coordinator node within the database can read and then write the data. Extracting in parallel eliminates the need to stream data within the database and allows almost linear scalability. Extracted data should be written to parallel s: Writing parallel data to a single creates a sequential bottleneck at the point of convergence. can write to multiple parallel s instead, increasing extraction performance to provide almost perfect scalability. Executing SAS applications in parallel yields linear and even super-linear scalability: Because distributes processing across all the available processors, overcomes the memory and disk constraints that arise in complex sequential applications run against large volumes of data. Super-linear scalability can result when 290

A SAS/AF Application for Parallel Extraction, Transformation, and Scoring of a Very Large Database

Paper 11 A SAS/AF Application for Parallel Extraction, Transformation, and Scoring of a Very Large Database Daniel W. Kohn, Ph.D., Torrent Systems Inc., Cambridge, MA David L. Kuhn, Ph.D., Innovative Idea