Leveraging Mainframe Data in Hadoop

Size: px

Start display at page:

Download "Leveraging Mainframe Data in Hadoop"

Edward Ross
5 years ago
Views:

1 Leveraging Mainframe Data in Hadoop Frank Koconis- Senior Solutions Consultant Glenn McNairy- Account Executive

2 Agenda Introductions The Mainframe: The Original Big Data Platform The Challenges of Ingesting and Using Mainframe Data on Hadoop Mainframe-Hadoop Data Integration Goals Mainframe-to-Hadoop Migration / Integration Options Syncsort and DMX-h Live DMX-h Demo Q & A 2

3 The Mainframe: The Original Big Data Platform Mainframes handle over 70% of all OTLP transactions They have a long, proven track record- over 60 years! They are reliable- operating continuously with zero downtime for years They are secure- access is tightly restricted and managed 3

4 Mainframes Still Process Vast Amounts of Vital Data Top 25 World Banks 9 of World s Top Insurers 23 of Top 25 US Retailers 4

5 But now our organization is implementing Hadoop Hadoop is the new Big Data platform The goal is for the Hadoop cluster to be the single central location for ALL data (the Data Lake ) According to Wikipedia, this should be the single store of all data in the enterprise ranging from raw data to transformed data * So you need to bring in all of the organization s data sources And that includes the mainframe The mainframe has vital data that you cannot afford to ignore when building your data lake *-

6 Enterprise Data Lake Without Mainframe Data = Missed Opportunity

7 The Challenges of Using Mainframe Data in Hadoop Mainframe knowledge and skills are difficult to find The mainframe workforce is aging rapidly Knowledge of existing designs and code may no longer be available Young developers almost never learn mainframe skills Security and connectivity issues Mainframes have a highly controlled security environment Installation of data-extraction utilities or programs may be forbidden The mainframe is mission-critical, so no action can be taken that could cause downtime Mainframe data looks VERY different from data on Windows, Linux or UNIX This is so important, it deserves its own slide 7

8 The Biggest Challenge: Mainframe Data Formats Mainframe files are not like files in Windows, Linux or UNIX There is no such thing as a delimited text file on the mainframe File types include fixed-record, variable-record, VSAM and others The mainframe uses EBCDIC rather than ASCII, but it s not that simple Text values are EBCDIC, but many numeric values are not Simple EBCDIC-to-ASCII conversion WILL NOT WORK Mainframe files can have VERY complex record structures Records may be very wide, containing hundreds or thousands of fields Records are usually not flat They often have sub-records and arrays (COBOL OCCURS groups) These may be nested many levels deep Often, a range of bytes in a record is used in several different ways (COBOL REDEFINES ) This means that the data looks different between records in the same file(!) Record layouts are defined by COBOL copybooks; here are examples 8

9 COBOL Copybook Example #1 Simple example of a COBOL copybook which defines a record layout: ** SALES ORDERS FILE 01 SLS-ORD-FILE. 05 CUSTOMER-ACCOUNT-NUMBER PIC S9(9) COMP ORDER-NUMBER PIC X(10). 05 ORDER-DETAILS. 10 ORDER-STATUS PIC X(1). These fields are packed decimal (not EBCDIC!) 10 ORDER-DATE PIC X(10). So EBCDIC-to-ASCII conversion would corrupt them! 10 ORDER-PRIORITY PIC X(15). 10 CLERK PIC X(15). 10 SHIPMENT-PRIORITY PIC S9(4) COMP TOTAL-PRICE PIC 9(7)V.99 COMP COMMENT-COUNT PIC 9(2). 10 COMMENT PIC X(80) OCCURS 0 to 99 TIMES DEPENDING ON COMMENT-COUNT. This is a variable-length array. The number of elements depends on the value of COMMENT-COUNT. So the size of this array will vary from record to record. 9

10 COBOL Copybook Example #2 (more complex) 01 LN-HST-REC-LHS. 05 HST-REC-KEY-LHS. 10 BK-NUM-LHS PIC S9(5) COMP APP-LHS PIC S9(3) COMP LN-NUM-LHS PIC S9(18) COMP LN-SRC-LHS PIC X. 10 LN-SRC-TIE-BRK-LHS PIC S9(5) COMP EFF-DAT-LHS PIC S9(9) COMP PST-DAT-LHS PIC S9(9) COMP PST-TIM-LHS PIC S9(7) COMP TRN-COD-LHS PIC S9(5) COMP SEQ-NUM-LHS PIC S9(5) COMP LN-HST-REC-DTL-LHS. 10 VLI-LHS PIC S9(4) COMP. 10 HST-REC-DTA-LHS. 15 INP-SRC-COD-LHS PIC S9(3) COMP TRN-TYP-IND-LHS PIC X. 15 BAT-NUM-LHS PIC S9(7) COMP BAT-TIE-BRK-LHS PIC X(3). 15 BAT-ITM-NUM-LHS PIC X(9). 15 TML-NUM-LHS PIC X(9). 15 OPR-ID-LHS PIC X(8). 15 HST-ADL-IND-LHS PIC X(1). 15 HST-REV-IND-LHS PIC X(1). 15 TRN-AMT-LHS PIC S9(9)V99 COMP HST-DES-LHS PIC X(25). 15 CUR-PRC-DAT-LHS PIC S9(9) COMP REF-NUM-LHS PIC X(3). 15 INT-FEE-FLG-LHS PIC X(1). 15 UDF-L01-LHS PIC X. 15 PMT-HLD-DAY-LHS PIC S9(3) COMP AUH-NUM-LHS PIC S9(5) COMP CUR-LN-BAL-LHS PIC S9(9)V99 COMP ITM-CNT-LHS PIC S9(2) COMP PYF-COF-REA-COD-LHS PIC X(3). 10 HST-TRN-ADL-DTA-LHS PIC X(240). Continues in right-hand column There are several different ways that data may be stored in this 240-byte area Continued from left-hand column 10 HST-TRN-RDF-1-LHS REDEFINES HST-TRN-ADL-DTA-LHS. 15 HST-TRN-DTA-1-LHS OCCURS 20 TIMES. 20 SPR-TRN-COD-LHS PIC S9(5) COMP SPR-TRN-REF-LHS PIC X(3). 20 SPR-TRN-AMT-LHS PIC S9(9)V99 COMP HST-TRN-RDF-2-LHS REDEFINES HST-TRN-ADL-DTA-LHS. 15 HST-TRN-DTA-2-LHS. 20 OLD-NMN-DTA-LHS PIC X(40). 20 NEW-NMN-DTA-LHS PIC X(40). 20 DAT-TO-DSB-LHS PIC S9(9) COMP RPT-BK-NUM-LHS PIC S9(5) COMP RPT-APP-LHS PIC S9(3) COMP RPT-LN-NUM-LHS PIC S9(18) COMP CMB-PMT-PTY-LHS PIC S9(3) COMP HST-TRN-RDF-3-LHS REDEFINES HST-TRN-ADL-DTA-LHS. 15 HST-TRN-DTA-3-LHS. 20 HST-OLD-RT-LHS PIC SV9(5) COMP HST-NEW-RT-LHS PIC SV9(5) COMP HST-TRN-RDF-4-LHS REDEFINES HST-TRN-ADL-DTA-LHS. 15 HST-TRN-DTA-4-LHS. 20 VSI-PMT-AMT-LHS PIC S9(7)V99 COMP VSI-INT-AMT-LHS PIC S9(7)V99 COMP VSI-TRM-LHS PIC S9(3) COMP INS-REF-NUM-LHS PIC X(3). 20 STR-DAT-VSI-LHS PIC S9(9) COMP HST-TRN-RDF-5-LHS REDEFINES HST-TRN-ADL-DTA-LHS. 15 HST-TRN-DTA-5-LHS. 20 NUM-MO-EXT-LHS PIC S9(3) COMP CLC-EXT-FEE-AMT-LHS PIC S9(5)V99 COMP EXT-REA-LHS PIC X(1). 10 HST-TRN-RDF-6-LHS REDEFINES HST-TRN-ADL-DTA-LHS. 15 HST-TRN-DTA-6-LHS OCCURS 11 TIMES. 20 ASD-BK-NUM-LHS PIC S9(5) COMP ASD-APP-LHS PIC S9(3) COMP ASD-LN-NUM-LHS PIC S9(18) COMP PMT-AMT-LHS PIC S9(9)V99 COMP

11 Mainframe-Hadoop Data Integration Goals 1) Making mainframe data available and usable on the cluster Interpretation and conversion of mainframe data formats Data validation and cleansing Integration of mainframe data with non-mainframe sources Use of mainframe data for data warehousing and BI 2) Reducing mainframe costs for storage and/or CPU Low-cost archival or backup of mainframe data in its native format Processing mainframe data on the cluster (yes- it can be done!) 11

12 Mainframe-to-Hadoop Migration / Integration Options So, what tools that can be used for mainframe-hadoop integration? The free open source tools that come with Hadoop Open-source conversion code generators: JRecord and LegStar Mainframe-based migration tools Legacy-ETL vendors Syncsort DMX-h Let s look at the capabilities of each of these 12

13 Integration Option: Open-source Hadoop Tools Standard Hadoop tools are used to convert mainframe data to ASCII delimited-text format and process it Often the obvious choice because these tools come with Hadoop Steps to integrate ONE mainframe data file: 1) Copy the file from the mainframe to edge node (using FTPS or similar tool) 2) Execute custom program (usually Java) to de-compose complex record structures and convert mainframe data types to delimited text file(s) and write to HDFS 3) Delete copy of mainframe file on edge node 4) Execute custom data-validation/cleansing process using MapReduce or Spark on the cluster (normally Java or Hive) 5) Execute custom MapReduce or Spark process to integrate or load into final target (Data Lake, RDBMS, NoSQL database, etc.) 13

14 Integration Option: JRecord and LegStar These are open-source code generators for file format conversion JRecord Uses CopybookLoader class to interpret COBOL record layouts LegStar Developer must use its Cobol Transformer Generator to create a COBOL-to-XML translator, then call that translator in his/her program Steps to integrate ONE mainframe data file: 1) Copy the file from the mainframe to edge node (using FTPS or similar tool) 2) Execute custom Java program to convert mainframe file by calling methods of CopybookLoader class (JRecord) or calling file-specific COBOL-to-XML translator (LegStar) and write to HDFS LegStar only: Convert XML output to delimited text file(s) 3) Continue with step #3 on previous slide 14

15 Open-source Options: Pros and Cons The one advantage of these options are that they are free Ironically, the primary disadvantage of these free tools is cost Development effort is very high Very large amount of custom coding is required Custom program needed for each source file which cannot be re-used Lack of support Difficult and expensive to find, hire and retain skilled developers Complex mainframe record types are a challenge Standard Hadoop tools: No easy way to handle complex records JRecord: The Java method calls can get very tricky LegStar: The COBOL Transformer Generator has limits Not future-proof A Java program is written for a specific execution framework such as MapReduce or Spark- what will you do when another one comes? 15

16 Integration Option: Mainframe-based Tools Migration tools that run in z-linux on the mainframe system Able to ingest and convert mainframe file formats from z/os Results are written to HDFS or a database Advantages: Does not stage data on edge node Disadvantages: Data validation and data quality checks require custom code Integration with other data sources requires custom code Conversion process runs on mainframe, not commodity hardware 16

17 Integration option: Legacy-ETL Vendors Many legacy ETL vendors now offer Hadoop versions Able to write read mainframe files and write to HDFS Primary advantage is existing skill set of ETL developers The devil you know Disadvantages Very high cost May have difficulty with very complex mainframe record structures Require dedicated metadata repository Single point of failure Becomes a performance bottleneck Do not process natively on the cluster Some work only on the edge node Those that work on the cluster are code generators (Java or Hive) Performance and scalability are limited 17

18 The Best Option: Syncsort s DMX-h Create complete mainframe-hadoop integration solutions, including data validation and integration with other sources Easy-to-use development GUI; no coding Very short learning curve Supports very complex mainframe record structures Native execution on cluster (NO code generation!) Superior performance Runs on all major Hadoop distributions Future-proof : Run ETL jobs on MapReduce, Spark or a future framework with no changes So let s find out more about the company Syncsort and DMX-h 18

19 Who is Syncsort? Syncsort is a leading Big Data company that has been in the high volume data business for over 45 years. Syncsort has successfully transformed its business model from the mainframe era to the age of Hadoop. Syncsort developed DMX which benefits from the algorithms and coding efficiencies developed from its mainframe heritage. 19

Syncsort Products Mainframe Solutions Linux/UNIX & Windows Hadoop

Hadoop Connectivity for Mainframe MFX Gold-standard sort technology for

Intelligence Mainframe Re-hosting DMX Full-featured data integration

in less time, with fewer resources and less cost Hadoop ETL ETL for

20 Syncsort Products Mainframe Solutions Linux/UNIX & Windows Hadoop Solutions High-performance Sort for System z ziip Offload for Copy Hadoop Connectivity for Mainframe MFX Gold-standard sort technology for over four decades saving customers millions each year over competitive sort solutions. High-performance ETL SQL Analysis & Migration ETL for Business Intelligence Mainframe Re-hosting DMX Full-featured data integration software that helps organizations extract, transform and load more data in less time, with fewer resources and less cost Hadoop ETL ETL for Business Intelligence Mainframe-Hadoop Integration DMX-h A smarter approach to Hadoop ETL: easier to develop, faster, lower-cost and future-proof 20

21 DMX-h Installation Architecture Development GUI is installed on Windows workstations To execute on the cluster, GUI sends request to DMX-h agent on edge node Hadoop Cluster Windows Workstation Edge Node (Linux) Cluster Data Nodes DMX-h Job Editor DMX-h Task Editor DMX-h agent DMX-h Engine DMX-h Engine DMX-h Engine DMX-h Engine DMX-h Engine DMX-h Engine DMX-h Engine DMX-h Engine DMX Engine is installed on Windows workstation AND edge node AND all cluster nodes, allowing job execution anywhere 21

22 DMX-h Mainframe-Hadoop Integration Features Mainframe file conversion and processing Fixed-record, variable-record and VSAM files Mainframe DB2 tables EBCDIC text and mainframe numeric types (COMP- types) Complex record structures, nested to any depth REDEFINES, OCCURS and OCCURS DEPENDING ON) Secure transfer from mainframe using FTPS and Connect:Direct Support for mainframe file compression, saving storage and time No need to stage data on the edge node Ability to store and process mainframe data in HDFS in its native format, without conversion(!), when desired Easy integration of mainframe data with other sources 22

23 How Easy is it to Interpret a Mainframe File? I ll demonstrate using my laptop The use case: We have been given a mainframe file and the COBOL copybook containing the record layout. The only 2 things that we have been told are that it is a fixed-record file and that the record size is

24 Use DMX-h to Easily Integrate Mainframe Data With 24

25 Mainframe-Hadoop Integration Use Cases Getting and interpreting the data (with no staging!) Reading from mainframe Conversion from mainframe formats (when desired) Data validation and cleansing Writing to cluster target Processing and data integration Joins and lookups to cluster and non-cluster sources Normalization & Aggregation Publishing and exporting Load external data warehouses (Oracle, Teradata, DB2, SQL Server, etc.) Efficiently generate data extracts for BI users Generate native files for Tableau and QlikView Storing and processing data in mainframe-native format Only DMX-h can do this!- more info later 25

Use Case: Mainframe data ingestion A DMX-h job running in the edge node* can connect to both HDFS and an external data source (such as the mainframe). This uses no disk space on the edge node!

26 Use Case: Mainframe data ingestion A DMX-h job running in the edge node* can connect to both HDFS and an external data source (such as the mainframe). This uses no disk space on the edge node! No limit on file size! This also works for any external source or database, even if it is remote. The source file can even be compressed. Format conversion and data-validation can be done within the same job. *- Can also be done using ANY node on the cluster, if network connectivity allows 26

27 Processing in the Cluster Using DMX-h Once data is in the cluster, additional DMX-h jobs can transform it The developer defines the operations to be performed Join, lookup, aggregate, filter, reformat, etc. There is no need to know the details of MapReduce or Spark DMX-h Intelligent Execution (IX) automatically runs the jobs on the cluster DMX-h jobs run natively on all cluster nodes No code generation! The DMX engine is installed on all nodes More efficient than Hive and other ETL tools which generate Java code Cluster nodes work concurrently, making the process highly scalable 27

28 DMX-h Intelligent Execution on Hadoop DMX-h has a feature called Intelligent Execution (IX) which automatically runs ETL jobs on the Hadoop cluster The DMX engine is installed on all nodes in the cluster, so the transformations run natively, with no extra code generation step IX works when the job runs, not at design time It currently supports MapReduce and Spark It could support other execution frameworks in the future This will require no changes to your DMX-h jobs in production So this means that the SAME DMX-h job can run On your Windows laptop (useful during development for unit testing) On an edge node or any single cluster node On the cluster using MapReduce On the cluster using Spark 28

29 Processing Native-Mainframe Data on Hadoop (!) Using DMX-h, it is actually possible to store and process mainframe data on Hadoop in its original native-mainframe format (!) DMX-h can even write mainframe-format target files No other tool can do this! Sometimes this is a great idea; for example, you can Use HDFS to archive mainframe datasets (MUCH cheaper than DASD) Because the data is 100% unchanged, it will pass any auditing requirement Quickly move mainframe datasets to Hadoop Sometimes you do not have time or resources for a conversion project The data can be moved, unchanged, and converted later You may not immediately know which data fields will need to be used Transform the native-mainframe data using MapReduce or Spark The results can even be moved back to the mainframe and used there! This allows you to offload CPU from the mainframe, reducing MIPS cost. The bottom line is that DMX-h can convert your mainframe data or work with it in its native form, whichever makes sense for you 29

30 DMX-h Live Demo So let s see it actually work using some mainframe data 30

31 DMX-h: Superior Performance and Easy Development Study by Principled Technologies for Dell Development comparison using DMX-h and open-source Hadoop tools Three different ETL processes (see table below) Open-source jobs were built by an experienced Hadoop developer DMX-h jobs were built by an entry-level developer with a few days of DMX-h training, and beat the performance of the open-source jobs on the same cluster: ETL Process Job Execution Time (minutes) Open-source DMX-h DMX-h Advantage Fact Dimension Load with Type-2 SCD 36:39 30:11 18% Data Validation 15:45 6:15 60% Mainframe File Integration 5:51 4:48 18% And DMX-h development was much quicker: Open source jobs developed by experienced developer: 8.4 days DMX-h jobs developed by entry-level developer: 3.8 days (54% less!) 31

32 Resources Syncsort Frank Koconis- Senior Solutions Consultant Glenn McNairy- Account Executive Development Comparison by Dell and Principled Technologies determined that DMX-h enables Easier and Faster Development Lower Development Cost Better Performance These are links to the actual reports from the study JRecord LegStar 32

Syncsort DMX-h. Simplifying Big Data Integration. Goals of the Modern Data Architecture SOLUTION SHEET

Syncsort DMX-h. Simplifying Big Data Integration. Goals of the Modern Data Architecture SOLUTION SHEET SOLUTION SHEET Syncsort DMX-h Simplifying Big Data Integration Goals of the Modern Data Architecture Data warehouses and mainframes are mainstays of traditional data architectures and still play a vital