IBM InfoSphere Streams Telecommunications Event Data Analytics Customization & Configuration

Size: px

Start display at page:

Download "IBM InfoSphere Streams Telecommunications Event Data Analytics Customization & Configuration"

Simon Stafford
6 years ago
Views:

1 Table of content IBM InfoSphere Streams Telecommunications Event Data Analytics Customization & Configuration 1 Introduction General setup process Get started Pre-requisites TEDA application setup Creation of TEDA application by Streams Studio Wizard Command-line based creation of TEDA application Interactive setup dialog Application setup with definition file Import of application projects into Streams Studio Build of application using default setting Preparation of the first start Domain and instance preparation Submit TEDA jobs Customized application setup Customized setup with Streams Studio Customized setup with command-line Customizing after setup Architecture Framework Architecture Framework Application Control Application Start Up Control Application Shutdown Application Framework Restart ITE Architecture Filename Ingestion Directory Scanner File Type Determination Get Sort Attribute Sort Function File Duplicate Check File Move Checkpointing and Housekeeping File Split Overwiev Round Robin Split Acknowledged Level2-Split Chain processing Checkpointing File Processing Recover with input file related checkpoint files after job restart Graceful shutdown Custom Context Checkpointing Configuration File header descriptions of ITE applications Rejection output...49 Page 1 of 118

2 Chain deduplication statistics output file Statistics output Lookup Manager Architecture General process flow in the Lookup Manager Lookup Manager hosttag definitions Lookup Manager file header descriptions Command input file Lookup statistics output file Ingest Transform Enrich (ITE) Customizing Setup configuration Configuration parameter Application control directory ITE variants Variant A File processing without special context processing Variant B Context processing with split based on file content Variant C Context processing with split based on file name Typical use-cases Mediation Extract-Transform-Load Campaign-Management Context configuration Adding additional toolkits for ITE application External trigger for initialization phase Lookup Manager Customizing Customizing file Command file Setup configuration Configuration parameter Application control directory Typical use-cases Lookup repository using CSV files as input source Description Required configuration settings Lookup customizing file DB disabled parameter Enable input file parameter Optional configuration settings Lookup repository using database as input source Description Required configuration settings for database use-case Lookup customizing file and connections.xml file Database configuration DB disabled parameter DB vendor parameter DB connection parameter Enable input file parameter Optional configuration settings for database use-case DB name parameter DB user parameter DB password parameter Lookup repository using database and CSV files as input source Description Required configuration settings...78 Page 2 of 118

3 DB disabled parameter DB vendor parameter DB connection parameter Enable input file parameter Optional configuration settings Simulation mode Description Required configuration settings DB disabled parameter Enable input file parameter Optional configuration settings Developer hints Application namespace Setup definition file External builder Makefile ITE FileReader customizing Custom File Parser ITE Multiple FileReaders in one job ITE Record to table stream conversion ITE Punctuation handling in Transformer Composite ITE Split streams in Custom Transformer ITE Parallel channels in Custom Transformer Operations Application management Trouble shooting Lookup Manager trouble shooting Application setup does not create the application structures Submission of job fails because of missing hosttags Lookup repository initialization fails Corrupt shared memory segments Lookup reader does not read shared memory segments Lookup Repository Reader Lookup Repository Reader parameter References Configuration parameter Common configuration parameter Application control directory Enable Multi-host ITE application Parallelism Feature Parallel Chains Custom host pools Disable Lookup Exception handler Context configuration file Cleanup configuration File ingestion level2 split File ingestion time precision File ingestion deduplication window File ingestion file type customization File ingestion file-date from file-name extraction Page 3 of 118

4 File ingestion context split File ingestion file-name pattern File ingestion scanner configuration File ingestion sort configuration Context startup control file Reader custom file statistics Reader custom parser statistics Reader file parser list Reader output schema Reader pre-file reader composite Reader CSV file encoding Reader file compression Tap composite for post context data processor bundle Tap composite for Transformer bundle Tap post context data processor bundle Tap Transformer bundle Transformer Composite Transformer Level-1-Split Transformer Lookup Schema Transformer Output Type Transformer postcontext composite Enable custom code Dedup Dedup Bloom days to keep Dedup Bloom probability Context Custom context days to keep Context custom composite Chain Sink archive input files in date dir Chain Sink Audit table file writer Chain Sink Custom checkpoint files Chain Sink Dedup Checkpoint files Chain Sink File Writer custom composite Chain Sink table files output Chain Sink target tables Chain Sink Type Lookup Manager DB disabled Enable input file DB vendor DB connection DB name DB user DB password Submission parameter Common submission parameter Application control directory on submission ITE application Input directory (on submission) Input directory list file (on submission) Output directory (on submission) Page 4 of 118

5 Checkpoint directory (on submission) Statistics directory (on submission) Number of Chains (on submission) Number of Chains per Context (on submission) Disable lookup (on submission) Filename deduplication window (on submission) File ingestion tap directory (on submission) File ingestion tap option (on submission) Cleanup schedule Month day (on submission) Cleanup schedule Week day (on submission) Cleanup schedule Hour (on submission) Cleanup schedule Minute (on submission) Lookup Manager Input directory on submission Statistics directory on submission Application control files on submission CSV source directory on submission Database name on submission Database user on submission Database password on submission Number of hosts on submission Figure Index Figure 1: TEDA Jobs Overview...8 Figure 2: New Project Wizard Selection...10 Figure 3: ITE application dialog (variant A)...11 Figure 4: Configuration values 2nd page...12 Figure 5: Lookup Manager dialog...13 Figure 6: Streams Studio with ITE Application graph...14 Figure 7: ITE application project with remote context...14 Figure 8: Select import of existing project...19 Figure 9: Select TEDA applications to import...20 Figure 10: Location and copy of CSV sources...21 Figure 11: Location of initial command file...22 Figure 12: Location of archived command file and statistic file...23 Figure 13: Location of CSV sample data for ITE job...24 Figure 14: ITE output files...24 Figure 15: Example ITE output file...25 Figure 16: Example of rejection output file...25 Figure 17: ITE application dialog (variant B)...27 Figure 18: ITE application dialog (variant C)...27 Figure 19: Framework Architecture...29 Figure 20: Overall Application States...31 Figure 21: ITE Application Design...36 Figure 22: File Ingestion...39 Figure 23: Multilevel-Split at Filename Ingestion...42 Figure 24: Round-Robin Split...43 Figure 25: Acknowledged Level2-Split...44 Figure 26: Files created in processing chain...45 Figure 27: Checkpointing during file processing...46 Page 5 of 118

6 Figure 28: Recover on job start...46 Figure 29: Graceful shutdown...47 Figure 30: Custom Context composite...48 Figure 31: Application design of Lookup Manager...51 Figure 32: Variant A of ITEMain job...55 Figure 33: Variant B of ITEMain job...57 Figure 34: Variant C of ITEMain job...58 Figure 35: Illustrated Mediation use case...59 Figure 36: Stream types in mediation use case...60 Figure 37: Illustrated ETL use case for three tables...61 Figure 38: Stream types in ETL use case convert to TableStreamType before Dedup...62 Figure 39: Stream types in ETL use case convert to TableStreamType after Dedup...62 Figure 40: Illustrated Campaign-Management use case...64 Figure 41: Stream types in campaign use case...65 Figure 42: Context configuration file (CHAIN_MAPPING_FILE)...66 Figure 43: Context startup control...67 Figure 44: Sample definition of ITE jobs...69 Figure 45: Example of command mapping...69 Figure 46: Example of segment customizing...70 Figure 47: Example of Command File and references...71 Figure 48: Architecture and information flow on case of CSV input...72 Figure 49: Example of CSV input in LookupMgrCustomizing.xml...73 Figure 50: Architecture and information flow on case of database input...74 Figure 51: Example of attribute mapping in LookupMgrCustomizing.xml and connections.xml files...75 Figure 52: Architecture and information flow for both source types of input...78 Figure 53: Namespace definition...79 Figure 54: Application namespace in TEDA framework...80 Figure 55: Example of setup definition file...81 Figure 56: FileReader composites overview...82 Figure 57: Multiple FileReaders...83 Figure 58: Transformer stream types...84 Figure 59: Transformer Composite...85 Figure 60: Split streams in Transformer composite...86 Figure 61: Channels in Transformer composite...87 Figure 62: Location of application control directory...96 Figure 63:Enable input files as source overview Index of Tables Table 1: Header of ITE rejection file...49 Table 2: Header of chain deduplication statistics output file...49 Table 3: Header of ITE statistics file...51 Table 4: Header of command input file...54 Table 5: Header of lookup statistics output file...54 Table 6: Overview environment setting for both database variants...76 Table 7: Common configuration parameters...91 Table 8: Common submission parameter...91 Table 9: Lookup Manager configuration parameters...91 Table 10: ITE configuration parameters...95 Table 11: Lookup Manager submission parameters...95 Table 12: ITE submission parameters...95 Page 6 of 118

7 Page 7 of 118

8 1 Introduction The scope of this document is the provisioning and guiding through the configuration and customizing of Telecommunications Event Data Analytics (TEDA) framework. In result, the user of TEDA framework should be able to understand the features and processing of the framework, developing own Ingest Transform Enrich (ITE) application. The overview about TEDA features and concepts is described in the 'IBM InfoSphere Streams Telecommunications Event Data Analytics Overview' document. The TEDA toolkit provides a customizable framework and a set of toolkits used in the framework. The former version of TEDA was an accelerator. The new version of TEDA is a full integrated part IBM InfoSphere Streams product. It makes the development more flexible and easier in configuration. Figure 1: TEDA Jobs Overview 2 General setup process An additional setup of environment is not required because of the integration in the product. The TEDA toolkit supports the user with additional tooling for setup, customizing and administration. In the first phase called 'setup', the user has to create the framework structure in any favored eclipse workspace by usage of the supporting tooling. The Get started chapter describes this process. Page 8 of 118

9 In the next step called customizing, the user has to customize the detailed configuration of the framework. The usage of optional customizing composites can be selected in this step, e.g. kind of parser. The Ingest Transform Enrich (ITE) Customizing chapter describes this process. The last phase of ITE application setup (coding) is the coding of the business logic in the required and optional composites. This it the usual IBM InfoSphere Streams coding process. The most important hints and tips are described in Developer hints The setup of the ITE application requires additional settings. The TEDA toolkit can provide set of 'hosttags' in case the multi-host solution is defined by ENABLE_MULTIHOST configuration parameter. They must be applied to configuration of the streams domain. This task is supported by an additional TEDA tool. The major part of the TEDA framework next to ITE jobs is the application management and lookup manager. Both functionalities are implemented in the Lookup Manager job. This is a special streams job that needs to be customized in the customizing phase by definition of an XML file. An example is provided by the framework. This customizing is described in the Lookup Manager Customizing chapter. Both applications provide with the default configuration sample logic and are able to process the sample input files. 3 Get started 3.1 Pre-requisites The TEDA framework does not require other environment then the InfoSphere Streams and the toolkits delivered by the project. Refer to 01.ibm.com/support/knowledgecenter/SSCRJU/SSCRJU_welcome.html. 3.2 TEDA application setup The TEDA framework provides tools for setup of TEDA applications. The Streams Studio supports the user with creation of new TEDA application by a wizard. An other way to create new TEDA projects is the usage of a script based tool delivered by TEDA toolkit. In this case, the user has to import the new created projects as import of existing eclipse project to the Streams Studio. This chapter describes the way, how the user can setup new applications only using the predefined settings. In this way, the user can learn step-by-step the process, can use the sample data that are prepared for this special example and can learn to verify the processes of running application. The steps presenting the creation of TEDA applications with one or more ITE applications, with Lookup Manager or without Lookup Manager job, customized by user are collected in chapter: Customized application setup. The major steps of setting up a project are: Setup of projects, choose one of: default sample: Creation of TEDA application by Streams Studio Wizard Command-line based creation of TEDA application user defined: Customized setup with Streams Studio Page 9 of 118

3.2.5.2 Customized setup with command-line Setup post-customizing : 3.2.6 Customizing after setup 3.2.1 Creation of TEDA application by Streams Studio Wizard InfoSphere Streams Studio provides a wizard to create Ingest Transform Enrich (ITE) or Lookup Manager applications.

10 Customized setup with command-line Setup post-customizing : Customizing after setup Creation of TEDA application by Streams Studio Wizard InfoSphere Streams Studio provides a wizard to create Ingest Transform Enrich (ITE) or Lookup Manager applications. After Streams Studio has been started, navigate to File New Project..., scroll down to InfoSphere Streams Telecommunications Event Data Analytics and open that entry. Streams Studio presents two different project options: ITE Application Project and Lookup Manager Application Project as shown in Figure 2. Figure 2: New Project Wizard Selection From here you choose the application type you want to create. Streams Studio opens an appropriate configuration dialog. For an ITE application the main configuration dialog is shown in Figure 3: Page 10 of 118

11 Figure 3: ITE application dialog (variant A) On this page you first choose the name of your ITE application project and under which path it is stored. Then you choose the variant of your application from the provided list. For each of these the wizard shows a schematic and a short explanation to remind you of the properties of each variant. The wizard automatically fills the namespace field with a lower case version of the project name. As long and whenever the namespace field contains the lower case version of the project name, the wizard synchronizes changes in the project name field to the namespace field. You enter the path to the directory containing the control files into the control path field. The name of the main composite is given and non-modifiable. The field is contained in the dialog for informational purposes only. After you have filled in all fields on this page, you may push the Finish button to create the application project. Alternatively, you may push the Next button to provide more information for your project: Page 11 of 118

Figure 4: Configuration values 2nd page On this page you may provide another toolkit name for the application, but the wizard will fill this field with the project name already.

12 Figure 4: Configuration values 2nd page On this page you may provide another toolkit name for the application, but the wizard will fill this field with the project name already. You may define a version and a required Streams version. Finally, you may also provide a description for your application. From your experience you may expect to configure toolkit dependencies on this page, but the Telecommunications Event Data Analytics applications define those dependencies in an external file named toolkitslist.xml. The information is then read-in by the external Makefile. For more information on this topic see: Adding additional toolkits for ITE application. If you chose to create a Lookup Manager application project in the New Project Wizard Selection dialog above, the wizard first presents the page shown in Figure 5 to you: Page 12 of 118

13 Figure 5: Lookup Manager dialog Again you first choose the name of the Lookup Manager application project and under which path it is stored. The wizard automatically fills the namespace field with a lower case version of the project name. As long and whenever the namespace field contains the lower case version of the project name, the wizard synchronizes changes in the project name field to the namespace field. Next you decide if you want to use a database and if you want to use a database input file. You may choose to use neither a database nor an input file, but this configuration is probably only useful during development. Finally, you enter the path to the directory containing the control files into the control path field. The name of the main composite is given and non-modifiable. The field is contained in the dialog for informational purposes only. As soon as you push the Finish button, the wizard will create the project's directory structure and copy the appropriate files from the template archive file into it. Several files are updated with the information you provided by filling in the fields of the wizard dialog pages. After the project's data has been copied into your workspace, Streams Studio starts to build your project. When this build finishes you may examine your newly created project, for example by opening the application graph as seen in Figure 6: Page 13 of 118

remote context If you are going to follow the default settings then use settings defined in Figure 3: ITE application dialog

14 Figure 6: Streams Studio with ITE Application graph When you are working with a remote environment the first wizard page displays appropriate configuration options and sample settings for remote context shown in Figure 7: Figure 7: ITE application project with remote context If you are going to follow the default settings then use settings defined in Figure 3: ITE application dialog (variant A). With the remote workspace field you can define with which remote directory your project shall be synchronized. Page 14 of 118

15 3.2.2 Command-line based creation of TEDA application The script based creation of TEDA application supports two kinds of input: interactive and file base. The tool is located in bin folder of TEDA toolkit path: $STREAMS_INSTALL/toolkits/com.ibm.streams.teda/bin The input parameters are described with the help command -h. ~]$ teda-create-project -h usage: <$STREAMS_INSTALL>/toolkits/com.ibm.streams.teda/bin/teda-setup-projects [options] options: -x --xml the xml-file collecting setup parameters for 'silent' setup - optional -b backup If this flag defined then the existing application folder will be copied to <appl-name>_<timestamp> folder as backup. The command stops the processing if application folder exists and neither --backup nor --force set. -f --force Using this flag, the setup process removes the existing folder and creates the new application folder. The command stops the processing if application folder exists and neither --backup nor --force set Interactive setup dialog The interactive setup begins with call of the tool without any parameter. Call setup tool [streamsadmin@host]$ teda-create-project Select ITEmain application entry: 1 Select ITE application variant entry: A Description: There is not a difference in setup process between the variant. So the variant A is described only as an example. The details about the ITE application variants are described in chapter ITE variants. Set APPLICATION_FOLDER entry: Enter default value selected. Description: This is the name of directory that includes the implementation of the InfoSphere Streams project. This is not the base path to the location in the file system. The values presented as default value are set pressing Enter button automatically. If you setup more then one ITE job then you must set a unique value to each ITE job APPLICATION_FOLDER because it is not possible to create two or more eclipse projects in the same folder. Set APPLICATION_NAMESPACE entry: Enter default value selected. Description: The parameter defines the core of the namespace of the ITE job. The details of this parameter are described in Application namespace. Page 15 of 118 Please select job (1-2): 1 - ITEMain 2 - LookupMgr Please select variant (A-C): Variant A - File processing without special context processing (default) Variant B - Context processing with split based on file content in Transformer Variant C - Context processing with split based on file name in File Ingestion Set parameter value (default: iteappl): Set parameter value (default: itejob.a):

16 Set DISABLE_LOOKUP entry: Enter default value selected. Set parameter value (default: true): Description: The parameter defines the whether lookup is not required by ITE job. The details of this parameter are described in Disable Lookup. Set APPL_CTRL_FILE_PATH entry: Enter default value selected Set parameter value (default: /home/streamsadmin/teda_applications/control): Description: The parameter defines the path to the directory that defines the application control interface between the Application Control Master of Lookup Manager job and ITE jobs using the lookup repository. This is the common path definition for all TEDA jobs. In case of configuration of more then one host, it must be located on shred file system. The details of this parameter are described in Application control directory. The value is set per default as: $HOME/teda_applications/control. Set PROJECT_BASE entry: Enter default value selected. Set parameter value (default: /home/streamsadmin/teda_applications): Description: This parameter defines the absolute path to the folder that collects the all job TEDA projects including the sources. This path points to the parent directory of the folder defined as value of the APPLICATION_FOLDER parameter. It supports environment variables using '%' before and after the environment variable name. The value is set per default as: $HOME/teda_applications. Summary entry: Enter Y(es) is set per default as capital letter. SUMMARY of parameter setting Parameter 'APPLICATION_FOLDER': iteappl Parameter 'APPLICATION_NAMESPACE': itejob.a Parameter 'APPL_CTRL_FILE_PATH': /home/streamsadmin/teda_applications/control Parameter 'CONTEXT_DISABLED': 0 Parameter 'DB_DISABLED': 1 Parameter 'FILEINGESTION_LEVEL1_SPLIT_DISABLED': 0 Parameter 'PROJECT_BASE': /home/streamsadmin/teda_applications Parameter 'TRANSFORMER_LEVEL1_SPLIT': 0 Continue...(Y/n)? Description: The next step gives an overview about all parameter that are set by the tooling. The following parameters are set by selection of a job kind or ITE variant: CONTEXT_DISABLED DB_DISABLED FILEINGESTION_LEVEL1_SPLIT_DISABLED TRANSFORMER_LEVEL1_SPLIT They are described in chapter ITE application. The DB_DISABLED parameter is set to '1' per default for the ITE application. If the user is going to use DB in the custom composites then it must be configured as described in the com.ibm.streams.db toolkit. Page 16 of 118

17 Backup or remove existing application folder defined as APPLICATION_FOLDER parameter entry: Enter 'backup' per default. Do you want to remove (R) or backup (B) the application folder iteappl in /home/streamsadmin/teda_applications (B/r)? Description: This option is only visible in case the application folder has been found in file system. The user can define what should happen with existing folder. In case of backup, the old existing folder will be removed to backup. The naming is defined by naming schema <APPLICATION_FOLDER-value>_<timestamp>. Next application request entry: Enter Y(es) must be set to create a Lookup Manager project. Do you want setup next job? (Y/n) Description: The creation of ITE job is finished now. The user can decide now to finish or to create new job. In this case the common setting are stored and the used does not need to set the values again. The common parameter settings are: PROJECT_BASE APPL_CTRL_FILE_PATH Select LookupMgr application entry: 2 Please select job (1-2): 1 - ITEMain 2 - LookupMgr Set APPLICATION_FOLDER entry: Enter default value selected. Set parameter value (default: lookupmgr): Description: This is the same parameter like for the ITE job. In this case the default value is set for Lookup Manager project lookupmgr. Set APPLICATION_NAMESPACE entry: Enter default value selected. Set parameter value (default: common.lookup): Description: The parameter defines the core of the namespace of the Lookup Manager job. It must be different as the setting for the other jobs. Set DB_DISABLED entry: Enter default value selected. Set parameter value (default: true): Description: The parameter defines if the database is a source of the Lookup Manager job. More detail are described in DB disabled. The value 'true' is defined per default and it does not set the database as repository source. Set LOOKUP_ENABLE_INPUT_FILE Page 17 of 118

18 entry: Enter default value selected. Set parameter value (default: true): Description: The parameter defines if the CSV files are the source of the Lookup Manager job. More detail are described in Enable input file. The value 'true' is defined per default and it sets the CSV files as repository source. Summary entry: Enter Y(es) is set per default as capital letter. SUMMARY of parameter setting Parameter 'APPLICATION_FOLDER': lookupmgr Parameter 'APPLICATION_NAMESPACE': common.lookup Parameter 'APPL_CTRL_FILE_PATH': /home/streamsadmin/teda_applications/control Parameter 'DB_DISABLED': true Parameter 'LOOKUP_ENABLE_INPUT_FILE': true Parameter 'PROJECT_BASE': /home/streamsadmin/teda_applications Continue...(Y/n)? Description: The next step gives an overview about all parameter that are set by the tooling. There are the common parameter and the parameter set by user in the list. Backup or remove existing application folder defined as APPLICATION_FOLDER parameter entry: Enter 'backup' per default. Do you want to remove (R) or backup (B) the application folder iteappl in /home/streamsadmin/teda_applications (B/r)? Description: This option is only visible in case the lookupmgr folder has been found in file system. The user can again decide what should happen with existing folder. Next application request entry: Enter N(o) is set per default as capital letter. Do you want setup next job? (Y/n) Description: With 'No', the tool will finish. Finally, one ITE project structure and one Lookup Manager project structure has been created in /home/streamsadmin/teda_applications folder by the teda-create-project TEDA tool. The user can customize the TEDA applications implementing own source code in custom composites using any text editor or use the Streams Studio. In case of the Streams Studio, the projects must be imported in eclipse environment in the next step Application setup with definition file The so-called silent setup requires an XML setup definition file that describes the location and settings of the applications. The file is described in Setup definition file. An example of TEDA application setup file is located in: $STREAMS_INSTALL/toolkits/com.ibm.streams.teda/etc/templates/TelcoFrameworkSample.xml The user must copy this file to an other directory, customize it and finally use as input for the command call: teda-setup-projects -x /<path-to-custom-xml>/telcoframeworksample.xml Page 18 of 118

The structure of all defined applications in the setup definition file will be created in the location defined by project base path. 3.2.

19 The structure of all defined applications in the setup definition file will be created in the location defined by project base path Import of application projects into Streams Studio The Streams Studio is recommended developer environment for InfoSphere Streams. The applications, created by teda-create-project tool, needs to be imported from existing projects in Streams Studio. In menu: FileImport... In dialog select: GeneralExisting Projects into Workspace Press Next button This dialog is shown in Figure 8. Figure 8: Select import of existing project Select: Select root directory: Browse...: look for project base path, here: /home/streamsadmin/teda_applications The applications are selected automatically. If not selected then press Select All button, see Figure 9 Page 19 of 118

Figure 9: Select TEDA applications to import Press Finish button 3.2.

20 Figure 9: Select TEDA applications to import Press Finish button Build of application using default setting The setup process provides all required makefiles, configuration settings and Streams Studio setting for the first test of the application build immediately after the setup. The Streams Studio environment detects the external builder automatically. The supported builder targets are: make make all make clean The typical build command for selected project will done in Streams Studio by: Right-Click on selected project: Build Project The build process can be started from command line using the mentioned make targets calling from project root directory: ITE application: cd /home/streamsadmin/teda_applications/iteappl make all Lookup Manager application: cd /home/streamsadmin/teda_applications/lookupmgr make all The details of the make command are described in chapter External builder Makefile Preparation of the first start After the build, the user shall check the created application using default settings and provided sample input data. The TEDA requires the movement of simple data to input directories. Page 20 of 118

21 Domain and instance preparation The user can use the default domain and default instance or setup own settings. The domain and the instance must be set as described in the InfoSphere Streams documentation. If the multi-host solution is configured then the build process creates the list of all hosttags for each application in the configuration directory: <project-base>/<applicationfolder>/config/hosttags.txt. They must be added to the streams domain. There are two ways: Streams Studio add hosttags to existing domain in Streams Explorer. Details in the Streams Studio description. Command line usage of streamtool commands Submit TEDA jobs There are few easy steps: Move samples CSV files Move samples from <project-path>/data/sample/ to <projectpath>/data/ folder. The location is shown in Figure 10. Figure 10: Location and copy of CSV sources Submit the Lookup Manager job Submit the job in your preferred way without any submission parameter. Check if job submitted and all PEs are in Running state and they are Healthy. Using the TEDA Monitor sample application the user can see the collected resource data of TEDA applications. It is possible to check the state of the Application Control Master located in the application control folder, per default in $HOME/teda_applications/control. cat /home/streamsadmin/teda_applications/control/appl.ctl initial The state 'initial' is correct result. Submit the ITE job Submit the job in your preferred way without any submission parameter. Check if job submitted and all PEs are in Running state and they are Healthy. Page 21 of 118

22 The TEDA Monitor sample application updates the view automatically. Similar to Lookup Manager, it is possible to check the state of the Application Control Master and Application Control Slave. Once the ITE jobs are submitted, the state changes to 'stopped'. cat /home/streamsadmin/teda_applications/control/itejob.a stopped cat /home/streamsadmin/teda_applications/control/appl.ctl stopped Move the dummy initial_all.cmd file Move the sample command file from sample folder to the created default command input folder './in/cmd' that is relative to <project-path>/data folder. See Figure 11. The Lookup Manager begins the initialization of lookup repository. Figure 11: Location of initial command file Check./in/cmd/archive folder. The 'in' folder is relative to <project-path>/data folder. The initial_all.cmd file will be moved after the finished initialization process and the Application Control changes the status to 'run'. See Figure 12. Page 22 of 118

23 Figure 12: Location of archived command file and statistic file Check statistic data The statistics are located in the./out/statisitics/<date>_lookupmanagerstatistics.txt file relative to <project-path>/data folder. The first line of the statistic is the internal initialization process. The second line includes the statistics to the processed command. The most interesting data provides the list of segments for verification: [sgmnt:segmaster1- table:dimmaster1- reserved: free: free %:99- processed:2] Values to be verified in DimMaster1 repository segment: sgmnt:segmaster1 processed:2 [sgmnt:segmaster2- table:dimmaster2- reserved: free: free %:99- processed:3] Values to be verified in DimMaster2 repository segment: sgmnt:segmaster2 processed:3 Move the ITE input data Move the sample input file into the default input folder of ITE job: <projectpath>/data/in from <project-path>/data/sample/csv. The default implementation only supports the input file in CSV format. The ASN.1 and binary parsing is not supported per default. Other implementation samples delivered by TEDA toolkit show the way of customizing of both remaining formats. The Figure 13 shows the location of CSV sample file. The input file includes 2 incorrect data records one duplication and one incorrect format of data record. The ITE job starts the file processing using dummy business logic provided per default by TEDA toolkit. The Application Control must be in 'run' state else the processing does not start. Page 23 of 118

Figure 13: Location of CSV sample data for ITE job Check ITE archive The ITE process moves the processed input files to the <project-path>/data/in from <project-path>/data/in/archive folder from

24 Figure 13: Location of CSV sample data for ITE job Check ITE archive The ITE process moves the processed input files to the <project-path>/data/in from <project-path>/data/in/archive folder from input folder. See Figure 14. Figure 14: ITE output files Check ITE output files. Correct input: The ITE job distributes the output files depends on business logic, duplication checks and validation checks. All the input data processed in correct way are provided in CSV format to the load directory located in <projectpath>/data/out/load folder. The format of the files is prepared for the database load process using the database native loader tools and the format can be Page 24 of 118

The content of the rejected file is described in Figure 16. There must be two lines in the case of default sample processing: duplication and format fault.

25 customized by user. The Figure 15 shows the example of content. Figure 15: Example ITE output file Incorrect input: The incorrect data are moved to the reject directory located in <projectpath>/data/out/reject folder. The content of the rejected file is described in Figure 16. There must be two lines in the case of default sample processing: duplication and format fault. Figure 16: Example of rejection output file Check ITE statistics The statistics output is located in the./out/statisitics/<date>_<namespace>statistics.txt file relative to <project-path>/data folder. The interesting outputs for verification: sentrecords=401 This the number of parsed records. The input file included 402 records. nlinesdroppedcsv=0 There are not dropped lines by CSV parser. recordstatscsv={"sample":{"errors":1,"records":401}} There is one errous line and 401 valid records. duplicate=false,invalidfile=false This file is valid and this is not a duplicate. rejectedinvalids=1,recordduplicates=1 There is one invalid record and one duplicated record in the valid records. chainsequencenumber=401,level1id="02",level2id="0" The chain defined for level1id="02",level2id="0" processed all 401 valid records. Shutdown ITE Jobs and Lookup Manager The TEDA framework provides a tool teda-shutdown-job for the support of graceful shutdown. The input parameters and function bases on streamtool canceljob command. The TEDA tool takes care to finalization of current porcessed Page 25 of 118

26 files, writes checkpoints and cancel the jobs. The advantage of the teda-shutdownjob usage is the controlled cancellation of ITE jobs and faster restart processing on job submits. teda-shutdown-job -h usage: teda-shutdown-job [-d,--domain-id <did>] [-i,--instance-id <instance>] [--collectlogs] [-h,--help] [-v,--verbose [<level>]] {[-f,--file <file-name>] [-j,--jobs <job-id,...>] [<jobid>... <jobid>,...]} Shutdown one or more jobs identified by their job id. Options and arguments: -d,--domain-id: The domain name. -h, --help: This output. -v,--verbose: Defines output level (0-3). Default: warning 1 - info 2 - debug 3 - trace -f,--file: File containing a list of job ids (one per line) -i,--instance-id: The instance name. -j,--jobs: List of job ids. --collectlogs: Collect the job (PE) logs.[streamsadmin@host]$ Example of the teda-shutdown-job call: [streamsadmin@host]$ teda-shutdown-job Customized application setup The customized application setup needs the same steps as described in TEDA application setup. The user has the choice between Streams Studio and command-line based setup tooling Customized setup with Streams Studio The general steps are described in chapter Creation of TEDA application by Streams Studio Wizard. The Figure 2: New Project Wizard Selection shows the user two kinds of TEDA applications. The user can define for either one or no one Lookup Manager project. If Lookup Manager job is defined then minimum one ITE job must be configured as controlled application. In case of ITE project, the user can define one or more ITE projects. Each one can define if the lookup feature is required for the business logic or it is not. The lookup feature is disable per default. If the user is going to use it then the checkbox Use Lookup Manager must be set as shown in Figure 3: ITE application dialog (variant A) else the ITE job will not synchronize the states with application control of Lookup Manager and it will run the files processing immediately. The user can define any value in each field in ITE application dialog (Figure 3). The one important setting is the Control path, that defines the synchronization and control folder of application control. The same value must be set during the setup dialog of Lookup Manager project. This Control path parameter is described in chapter Application control directory. Next to one project of variant A shown in Figure 3, the user can define additional or other ITE variants that are described in ITE variants. The following figures show the example of the wizard page for the ITE application variants B and C: Page 26 of 118

2 Customized setup with command-line The user defined setup use the

27 Figure 17: ITE application dialog (variant B) Figure 18: ITE application dialog (variant C) Customized setup with command-line The user defined setup use the same tooling as for default samples described in Commandline based creation of TEDA application. Page 27 of 118

28 In case of interactive mode, the user can fill in values as required for the project. It is recommended to define all projects in one run of the teda-create-project command because the common parameter like APPL_CTRL_FILE_PATH are cached during whole run and set automatically. In other case, the user must take care and set same values during each run. In case of usage of a definition file, the user has to prepare any XML file as described in chapter 7.2 Setup definition file Customizing after setup In case of Lookup Manager usage, the user must create the LookupMgrCustomizing.xml file. The major setting for the application control is the correct definition of all ITE applications enabling the Lookup Manager feature Disable Lookup in the customizing file. If setting is missing then the ITE application will not startup. If incorrect ITE application is set then neither ITE applications or Lookup Manager will start the processing. The Lookup Manager customizing file is described in Customizing file. The next step are similar to the default sample process described in Import of application projects into Streams Studio in case of command-line processing. The build of applications needs to be done as described in Build of application using default setting. There is not difference. In case of multi-host configuration the hosttags must be applied to the domain configuration in the next step. The user can view the hosttags file for each ITE application and for the Lookup Manager as defined in <project-base>/<application-folder>/config/hosttags.txt file and apply them to hosts in Streams Studio. The single-host solution does not require any hosttags handling. See configuration of Enable Multi-host. The start of instance and job submission are typical InfoSphere Streams processes. The data verification depends on business logic implemented in the ITE jobs. The location of input and output directories depends on configuration settings and submission parameters used by user. 4 Architecture 4.1 Framework Architecture Page 28 of 118

Figure 19: Framework Architecture The TEDA framework can be used with or without enrichment function. If it is used without enrichment a ITE job can run as standalone application.

29 Figure 19: Framework Architecture The TEDA framework can be used with or without enrichment function. If it is used without enrichment a ITE job can run as standalone application. If the ITE application has enrichment enabled, an separate Lookup Manager is required for the preparation of the lookup data in shared memory. In those configurations a synchronization of the update of the lookup data is required. One Lookup Manager application can control/manage the lookup data for several ITE applications. The general setting if a ITE job runs with or without a Lookup Manager job is set up with parameter: Disable Lookup. If a ITE job is configured to run with a Lookup Manager a submission time variant of this parameter (see: Disable lookup (on submission) ) is available. With this parameter it is possible to disable the synchronization of ITE job and to disable the lookup code temporarily. So you can run the ITE job standalone for test or development purposes. The synchronization of the jobs and the overall job control is implemented my means of files in the shared file system. For this purpose a parameter Application control directory is used to point all synchronously running jobs to a common directory for the exchange of status information. Each set of TEDA jobs (Lookup Manager and all synchronized ITE jobs) needs an unique Application Control directory. Even if the ITE job runs in a standalone configuration the Application control directory parameter is necessary. It is used to store the general ITE job status. As long as the namespace of the job is unique a common Application control directory for all standalone ITE jobs is sufficient but you may chose to use a separate Application control directories for different standalone ITE jobs. There is also a submission time variant of this parameter: Application control directory on submission Page 29 of 118

30 4.1.1 Framework Application Control If the ITE application has enrichment enabled, the access to the shared enrichment data (which are located in shared memory) must be synchronized to maintain mutual exclusive access. Per convention the Lookup Manager job is the only job with write access to the shared memory segments. All ITE jobs have read access to the shared memory segments. Thus the activities of Lookup Manager and the assigned ITE jobs are synchronized with a master slave architecture. During command execution of the Lookup Manager, the file processing of the ITE jobs is disabled. The control logic of the ITE job ensures that the processing is stopped at file boundaries. The Application Control logic may enter the following global states: startup, initial, stop, stopped, start, run, shutdown and termination. The terms stop, stopped, start and run indicate the status from the ITE job point of view. In Lookup Manager and in all ITE jobs there is a dedicated application control logic implemented which controls these mutual exclusive access to the shared memory resources: The Lookup Manager implements the master part of the application control logic: When a command request to update/initialize the shared resource is received it is queued and a transition to the state STOPPED is issued. Once this state is reached, the command is dequeued and the shared resource is written. When the execution of all commands has finished a state transition to state RUN is issued. The ITE jobs implement the slave part of the application control logic: During state RUN the file processing is enabled. When the application control logic recognizes a request to change the state to STOPPED it stops the file processing at a file boundary. When the state changes later to RUN, the file processing is continued. Additionally a scheduled event to trigger the housekeeping in the de-duplication logic is generated (the refresh trigger). When this trigger arrives, the application controller stops the file processing and the context controller issues commands to refresh the context and de-duplication state. Page 30 of 118

31 Cancel job STARTUP STARTUP Cancel job STARTUP STARTUP all slaves in state INITIAL INITIAL INITIAL Start check passed forced-restart shutdown shutdown INITIAL INITIAL Start check passed master state STOP STOP STOP STOP: STOP: send send stop stop req req all slaves in state STOPPED restart forced-restart master state INITIAL all chaines stopped STOPPED STOPPED STOPPED STOPPED No write command pending && one command completed shutdown && no command pending shutdown master state START & context ready forced-restart all slaves in state RUN START START restart forced-restart shutdown && all chains stopped START: START: send send start start req req all chaines started New write command received RUN RUN RUN RUN Master state STOP INITIAL SHUTDOWN SHUTDOWN SHUTDOWN SHUTDOWN TERMINATION TERMINATION TERMINATION TERMINATION Figure 20: Overall Application States STARTUP The state set at start-up. Page 31 of 118

32 INITIAL This state is set at the beginning and in case of an restart of the application control logic. STOP This state will be set on coming write request at Lookup Manager job. STOPPED This state will be set by slave job after the stop request and the file processing of current file finished per slave. Once the state is reached by all slaves the master job can start the update processing of the cache. Additionally this state is used in ITE job to restore the context state after application initialization. START This state will be set by master job after the finalization of all updates and the incoming command queue is empty. The START state is set in the REQUEST file of the master job. Before a ITE jobs enters the state START all context ready conditions are checked. The slave will remain in state stopped until all ready conditions are met. RUN This state will be set by each slave job after re-/start of file processing of the new file. The state is set in the response file of the ITE job. SHUTDOWN - This is state is entered after in case of graceful shut-down. Application control maintains that all chains have finished file processing. The jobs can now be cancelled. TERMINATED This state will be set by each job at cancellation. The communication of the master with the slaves is done via files which are located in a common directory in the shared file system (see: Application control directory and Application control directory on submission). There is one 'application state control file' (appl.ctl) which is maintained from the Lookup Manager (master). Each ITE job (slave) maintains a 'application state response file'. The name of the response file is taken from the namespace of the ITE job. Application state control file and application state response files are located in the 'application control directory'. The control and response files contain one line with the current state of master or slave. The slaves read the application state control file with the (new) target state. The master recognizes the states of all slaves by reading of all state response files. One Lookup Manager can control several ITE jobs. The list of all possible ITE job which can be controlled from one Lookup Manager is defined at compile time and is taken from the LookupMgrCustomizing.xml file. When the Lookup Manager job is submitted a list of the active ITE jobs can be defined at submission time (see: Application control files on submission). This active ITE jobs list must be a subset of the compile time possible ITE jobs list. If the submission parameter is omitted, the possible ITE jobs list becomes the active ITE jobs list. The meaning of the states and the state change trigger events: STARTUP: This state is entered when an job is submitted. In this state the file and command processing is disabled, no checkpoint files are loaded or changed and no control files are are changed except of the startup control file. This state is left and the INITIAL state is entered, when startup check has been passed which means that no other job with the same namespace is running. If the startup check fails, this state is never left. INITIAL: Lookup Manager: When this state is entered, the Lookup Manager deletes the status reponse files from all possible ITE jobs. When this state is entered the list of the active ITE jobs is genrated either from the submission tile list or from the restart action request. Page 32 of 118

33 The command processing is disabled during this state. The application control logic waits until a state response file is seen with state INITIAL from all active ITE jobs. If this condition occures, the state changes to STOP. ITE jobs: When this state is entered the application control logic waits until a valid application state control file with state INITIAL is seen. If so the logic writes a state response file with state INITIAL. When this state is entered, the application controller initializes the restoration of the file deduplication state and the context state from checkpoint files. The application control logic waits until a state control file is seen with state STOP. If this condition occures the state STOP is entered. STOP: Lookup Manager: Command processing is disabled The application control logic waits until a state response file is seen with state STOPPED from all active ITE jobs. If this condition occures, the state changes to STOPPED. A forced-restart action request may trigger a change to the INITIAL state. ITE jobs: In this state the application control logic waits for the acknowlegements that the file processing in the chain has been stopped. If so enter the state STOPPED. (This conditiion is immediately set to true when this state was entered from INITIAL). STOPPED: Lookup Manager: The command processing is enabled. The lookup cache is initialized/written as long as new commands are queued. This state is left and the state START is entered when no new command request is queued and when at least one command was executed. This state is left and the state INITIAL is entered when a restart or a forcedrestart action is encountered and no command request is queued. ITE jobs: The file processing is disabled. Cleanup is enabled. That means cleanup requests from the scheduler are executed. This state is left when the application state control file flags the state START and when the context cleanup logic is idle. If this condition is met the ITE job enters the state START. This means a state change to START is deferred until all context controler have finished a scheduled or the initial clean up request. The application controller logic sends a request to sttart the file processing in all chains. This state is left when the application state control file flags the state INITIAL. If this condition is met, the ITE job enters the state INITIAL. This means a restart action is executed. START: Lookup Manager: The command processing is disabled. Page 33 of 118

34 ITE jobs: The application control logic waits until a state response file is seen with state RUN from all active ITE jobs. If this condition occures, the state changes to RUN. A forced-restart action request may trigger a change to the INITIAL state. The file processing is started and the control logic waits for the acknowledgement from all file processing chains. If so enter the state RUN. RUN: Lookup Manager: The command processing is disabled. This state is left and the state STOP is entered when new command request is queued. This state is left and the state INITIAL is entered when a restart or a forcedrestart action is encountered ITE job: The file processing is enabled. Cleanup is enabled. That means cleanup requests from the scheduler are executed. When the scheduler triggers a cleanup request the file processing is stopped temproraily and the apropriate events are handled by the context clean up logic. This state is left when the application state control file flags the state STOP and when the context cleanup logic is idle. If this condition is met the ITE job enters the state STOP. This means a state change to STOP is deferred until all context controler have finished the scheduled clean up request. The application controller logic sends a request to stop the file processing in all chains. This state is left when the application state control file flags the state INITIAL and when the context cleanup logic is idle. If this condition is met, the ITE job enters the state STOP. This means a restart action is executed. The application controller logic sends a request to stop the file processing in all chains Whe a job is canceled the application state file is set to TERMINATION Application Start Up Control The TEDA framework provides a start-up control logic which ensures that a specific application is not running more than once in a system. When a application (ITE job or Lookupmanager job) is started a checke is performed if the application is already running. If another active applicatio with the same namespace is found the application does never enter one of the processing application states. In this case the applcation will remain in STARTUP state forever. The start-up control logic requires a common control directory see parameter: Application control directory. Note: Care must be taken if the application control directory is specified at submission time (Application control directory on submission). If the applications are started and poninted to different directories the application control and start-up contlol logic can not function propperly. Page 34 of 118

35 4.1.3 Application Shutdown In addition to canceling a Streams job the TEDA application framework offers a job shutdown command. This is the recommended shutdown method. The teda-shutdown-job command should be used instead of streamtool canceljob. In case of Lookup Manager job a shutdown means that all queued update commands are finished before the shutdown state is entered. Thus the Lookup cache is left in a consistent state after shutdown. In case if ITE job a shutdown means that the processing is stopped at file boundaries and that state checkpoint files are written if possible. See: Graceful shutdown The gracefull shutdown is implemented by means of the action control file. The action control file is located in the application control directory and has the file-name: 'appl.ctl.cmd'. It contains a single line with the string 'shutdown' Application Framework Restart Ths complete lookup cache initialization may be a time consuming job. Therefore the TEDA framework provides the possibility to add or remove a ITE job from the run configuration during run-time of the lookup manager. Additionally a single ITE job may be re-started when the Lookup Manager runs. This is achieved by means of the action control file and the tedaframework-restart command. When the Lookup Manager is started the list of the actually running ITE jobs is provided with paramter: Application control files on submission. The application control logic syncronizes the cache activities of the Lookup Manager and all ITE jobs of this list. So it is possible to run the Lookup Manager only with a sub-set of all compiled in ITE jobs. This list can be changed during run-time by means of the action control file. The action control file is located in the application control directory and has the file-name: 'appl.ctl.cmd'. It contains a single line with the requested action 'restart' and a comma separated list of the controlled ITE job namespaces. When the application contol logic recognizes a restart action request, it checks the ITE job list against the compiled in ITE job list. If all members of the list from the action control file are valid, the restart is processed. If the list from the action control file contains members, which are not in the compiled in ITE job list the restart action request is not processed and a error message is given in the action control file. When the restart command is processed a state change either from state RUN or from state STOPPED to state INITIAL is executed. When the restart action is performend from state STOPPED, this means chache update commands are executed, the restart action is delayed until the execution of all commands has finished. If the application control logic stucks in state START or state STOP because of a crash of the ITE jobs, a forced variant of the restart action is available. In such a situateion the user may issue the forced-restart action. In this case the action control file must contain the string 'forced-restart' and a comman separated list of the actual ITE jobs. If the operator wants to stop or start ITE jobs without re-start of the Lookup Manager job. the general sequece should be: Shutdown or cancel ITE jobs which must be stopped, Submit ITE jobs which must be started, Issue the restart action. Use the forced-restart action if the state of the Lookup Manager stucks in state START or STOP due to a crash of an ITE job. Page 35 of 118

4.2 ITE Architecture The ITEMain job is the TEDA worker application for file to file processing and the workflow consists of: 1. Scanning and parsing the input files 2.

36 4.2 ITE Architecture The ITEMain job is the TEDA worker application for file to file processing and the workflow consists of: 1. Scanning and parsing the input files 2. Extracting the files and then enriching and transforming the data 3. Removing duplicate records 4. Writing the records to output files 5. Collecting and storing of file processing statistics Figure 21: ITE Application Design Page 36 of 118

37 In the Figure 21: ITE Application Design the functional blocks in red boxes needs to be customized by writing SPL code, the orangle boxes can be customized by writing SPL code and the blue boxes are the common code that can be configured or disabled via configuration parameters. The ITE application has 6 main functional blocks: IngestFiles Scanning of one or input file directories file name deduplication file type validation context split and load distribution of file info tuples to processing chains Context record deduplication (BloomFilter) custom context (e.g. custom aggregations) Control ApplCtrl is controlling the job startup phase, checkpointing, chain processing and exchanges the job status information with Lookup Manager Scheduler can be configured to refresh stateful logic and checkpoint files Statistics Creates log entries for each processed files Log files are written to configurable statistics directory and are archived on date switch Creates and updates file processing summary metrics for TEDA Monitoring GUI Taps Merged ouput streams from all chains for custom tap logic (e.g. single FileSink or MetricsSink) Chain ChainProcessorReader ChainControl ensures that one file after another is processed. The file info tuple is forwarded to the FileReader when the file processing is completed at ChainSink. Chain processing can be paused and unpaused. PreFileReader can be used to extract data from the filename only once. This information will be forwarded to downstream operators. The FileReader block can be configured for one for more File parsers. A FileReader contains the FileSource, Parser and Converter operators. If multiple FileReaders are configured the FileTypeValidator needs to set the file type attribute to address the right file type specific FileReader. Validator code needs to be customized for specific validation checks on input data. Invalid records needs to be rejected here. ChainProcessorTransformer This functional block contains the business logic of your application Records can be enriched with lookup data Record transformation takes place here, e.g. Transform from record stream to table stream since you need to write to several output tables. Page 37 of 118

38 Context split at variant B to send the records to the right context composite based on tuple data. ChainSink PostContextProcessor is rejecting the duplicate records and can be used to transform your tuple for the FileWriter RejectFileWriter writes the information about rejected records ContextRestoreWriter is writing checkpoint files, e.g. hash code for restoring BloomFilter states or files to restore a custom context. ChainFinalizer is sending the acknowledge to chain control to proceed with next file and moves the input and output files to target directories Filename Ingestion The function complex 'Filename Ingestion' scans for files and feeds the worker chains with the file-names to process. It has the following essential features: directory scanner file type determination get sort attribute file sort file duplicate check file move for invalid and duplicate files level1 split level2 split checkpointing and housekeeping These functions are configurable and some functions may be switched off completely. Page 38 of 118

39 start signal from control dropped files to statistics writer Filename Ingestion clean signal from control acknowledged files from worker chains: filename, success, reprocess, filetime process files to worker chains: filename, reprocess, urgent, filetime, filetype, level1id, level2id Figure 22: File Ingestion Directory Scanner The directory scanner function scans one ore several directories for files matching the process file pattern and for file matching the re-process file pattern (File ingestion file-name pattern). All file-names with no pattern-match and all hidden files (dot files) are ignored. The file name will only be generated the first time the file is seen during a directory scan until it is re-created. The modification time (m-time) is used to detect if a file has been re-created or changed. The scan operation will switched on from the application controller when all function units report a ready signal after start up. This means the Lookup manager has completed the execution of the 'initial command' and all Context functions have completed the loading of the state. The directory scanner function does not perform any file sort. Note: Because the modification time of the file is used to detect if a file has been re-created, it is possible that very large files are still being written when a directory is being scanned. In this case, the same file name may be generated multiple times, In order to avoid this, the file should be written into a hidden file name (or a different directory on the same file system) and Page 39 of 118

40 then renamed to the target name when complete. If a regular expression pattern is being used to match only certain files, creating the new files under a name that fails to match the pattern, and then renaming, will also work. The scan operation in a single directory must be configured with a single configuration parameter: Input directory (on submission). This parameter points also to the directories for the file move function. The scan operation in multiple directories requires configuration parameter Input directory (on submission) which points to the directories for the file move function, and Input directory list file (on submission) which points to a file with the list of the input directories. If this feature is used, the files detected in the first directory of the list are marked urgent. Urgent files are queued in an separate file queue which has precedence over normal the file queue. Note: The directories of both parameters should be located in the same file system. Additionally the file-time attribute is generated from the m-time of the file. The precision of this time can be configured with parameter: File ingestion time precision. The file time attribute is used later to decide when the file name has to be removed from the file deduplication storage. Between scan cycles the directory scan function sleeps for a guarnteed sleep time (see: File ingestion scanner configuration) File Type Determination File ingestion provides an attribute 'file type'. This attribute can be used later from the parser function. This function can be customized see: File ingestion file type customization. The custom operator may choose to consider some filenames as invalid and emit them to the second output port. These file will be moved to the invald directory from the move function Get Sort Attribute The file ingestion function provides some built in operators to extract the sort attribute. The sort attribute may be used later from the sort function. The user may also provide a custom operator to extract the sort attribute. See: File ingestion sort configuration. This function may detect invalid files. These files will be moved to the invalid directory from the move function. If the paramter FILEINGESTION_SORT_ATTRIBUTE equals to 4 the File ingestion filedate from file-name extraction configuration is also necessary. Note: In this case the file time attribute is overwritten with the file time obtained from the file-name. Per default this function is disabled Sort Function The file ingestion function provides some built in operators to sort the input files by sort attribute. The built in sort operation is activated after each scan cycle (which is marked by a window punctuation mark). The user may also provide a custom operator to perform the sort function. See: File ingestion file-date from file-name extraction Per default this function is disabled. Page 40 of 118

41 File Duplicate Check Every processed filename is stored in a transient storage and a duplicate check is performed for each new file from the directory scanner. Duplicate filenames are moved to the duplicate directory. The duplicate check is performed with the base-name of the file. Thus the same filename in different directories will be recognized duplicate. After file processing the filenames are returned to this operator and the success flag indicates the status of the processing. If a file was not successfully processed, the name is removed from the transient filemame storage. Thus those failed files can be re-processed immediately after processing. Sucessfully processed files are marked as acknowledged in the transinet store and will be kept there at least until the next clean up. Additionally the chain sink function will write these files in a persistent file name storage which is located in the checkpoint directory an moves the file to the archive directory. Files which are recognized from the directory scanner and are not yet acknowledged produre an error trace and will not be moved to the duplicate directory nor feeded to the level1 split. These file are most probably long files copied to the input directory. Filenames that match the reprocess file pattern (File ingestion file-name pattern) bypass the duplicate check and are marked with the reprocess flag. These files can be processed more than once at any time. If the reprocess file pattern equals the process file pattern this function is switched of completely. Especially there is no transient and no persisitent filename storage used File Move The file move function moves invalid files and duplicate files into the invald- and duplicate directoy. These directories are subdirectories of the input directory see: Input directory (on submission) Checkpointing and Housekeeping The processed files are stored in a transient and in a persistent store. After a crash the persisitent store can be recovered from the persistent one. During start up the file ingestion function searches for checkpoint files from all chain sinks and stores the content of these files in the transient checkpoint store and additionally in a new checkpoint file. In paralle the file time is supervised and file which are older than parameter File ingestion deduplication window are dropped. Housekeeping of the deduplication stores must be performed during start-up and subsequent clean-ups may be triggered. The schedule of these trigger events can be controlled with parameters see: Cleanup configuration. When a clean-up trigger arrives the file processing is stopped from the application control function and a trigger event is submitted. When the trigger event arrives at the file ingestion function the clean up is performed like in the start up case. Additionally the not acknowledged files are never removed from the transient store. Page 41 of 118

4.2.1.8 File Split Overwiev Figure 23: Multilevel-Split at Filename Ingestion The level1id represents the context ID and the level2id is the channel ID.

42 File Split Overwiev Figure 23: Multilevel-Split at Filename Ingestion The level1id represents the context ID and the level2id is the channel ID. As kind of address information level1id and level2id are passed as composite parameters to each ChainProcessor composite and child composites. In addition these 2 Ids are present as tuple attributes in several stream schemas in order to route the tuples to the corresponding receivers. Page 42 of 118

4.2.1.9 Round Robin Split Figure 24: Round-Robin Split Filename tuples are distributed with Round Robin algorithm.

43 Round Robin Split Figure 24: Round-Robin Split Filename tuples are distributed with Round Robin algorithm. Depending on the interval of files landed in input directory and on file size it could occur that chains are not equally loaded. Recommendation: Choose this configuration for small files or continously files landing in input dir Configuration parameter: FILEINGESTION_ACKNOWLEDGED_LEVEL2_SPLIT=0 Page 43 of 118

4.2.1.10 Acknowledged Level2-Split Figure 25: Acknowledged Level2-Split Filename tuples are send one by one to the chain.

44 Acknowledged Level2-Split Figure 25: Acknowledged Level2-Split Filename tuples are send one by one to the chain. Advantage: Perfect load balanced chains Disadvantage: Overhead in control tuple and increased number of threads processing the acknowledge tuple in Level2Split PE Recommendation: Choose this configuration for different sized files and files landing in chunks in input dir Configuration parameter: FILEINGESTION_ACKNOWLEDGED_LEVEL2_SPLIT= Chain processing The working chain is split into 2 functional block the ChainProcessor and the ChainSink. Depending on the job configuration a context block is between them. If context is configured the PostContextProcessor is added to the ChainSink. In the PostContextProcessor the detected duplicates are rejected if deduplication is enabled. It is possible to replace the PostContextProcessor with custom code with the Transformer postcontext composite configuration parameter. Thus you can plugin your transformer code before (Transformer Composite) and/or after the context. Reading of input files take place in FileReader composite. It is possible to include multiple file readers, e.g. one parser per file type (Reader file parser list). At end of each file a punctuation is sent on Data Stream to indicate the end of file. In addition the FileReader sends one statistic tuple at end of file containing information of the parsed file (e.g. number of read/sent records, bytes, errors etc.) on the Statistic Stream. Page 44 of 118

45 Figure 26: Files created in processing chain Each chain produces output files belonging to an input file and processes one file after another. Therefore the output files contain data from a single input file only and have the input filename as part of the output filename. At end of file processing the Chain Finalizer operator moves the processed input file to archive directory or in case of detected errors (non empty list of rstring attribute in statistic tuple) the input file is moved to failed directory. The output files written by the FileReader operator are moved by the Chain Finalizer operator to the output/load directory if file is processed successfully without any errors. Invalid tuples can be rejected at Validator, Transformer or PostContextProcessor operator if required by business logic Checkpointing The ITEMain job ensures data integrity on file basis. Successfully processed files result in committed output files and input files moved to archive. After job restart ITEMain does not read already processed files since they are moved to archive directory. Incomplete file processing (e.g. job is canceled or PE crashed) causes that output files remain in working directory (not moved to commit dir) and input files remain in archive directory. For data consistency we need to ensure that data hold in memory by operators in the processing context is saved as checkpoint file to disk. In addition to detect duplicate filenames the successfully processed files are written to a checkpoint file in order to recover this list of filenames after job restart. For each input file a corresponding checkpoint file is written for Record Deduplication BloomFilter state can be recovered by reading the hashcode files (checkpoint file containing the values of the hashcode attribute created by business logic code to detect duplicates). Page 45 of 118

46 Custom Context Custom operator in the processing context (e.g. for aggregation) needs to store the tuple attributes used by this operator File Processing Figure 27: Checkpointing during file processing For each context one input file related checkpoint file is written. They are moved to checkpoint-commit directory if input file is processed successfully Recover with input file related checkpoint files after job restart Figure 28: Recover on job start For each processed input file a file related context checkpoint file exists. All these files are stored in context specific directories and are read during job startup to recover the context. Page 46 of 118

47 Graceful shutdown Figure 29: Graceful shutdown In addition to canceling a Streams job the TEDA application framework offers a job shutdown command. This is the recommended shutdown method. The teda-shutdownjob command should be used instead of streamtool canceljob. Advantages of graceful shutdown: 1. Job waits until each chain has completed the processing of the file in use. 2. Job creates context checkpoints for a faster job startup. Next time the job is started again it will use state checkpoint files for each context instead of reading the huge number of file based context checkpoint files. Restrictions of graceful shutdown: 1. In case of unhealthy job state (e.g. PE is down) the job is canceled by tedashutdown-job script only. 2. Writing of state checkpoint files is omitted if any file processing failed, e.g. incomplete file processing. In these cases above the input file related checkpoint files are used on next job start to recover the contexts. Recover with state checkpoint files on job startup: 1. Clean-up of hash code files and custom context checkpoint files. Based on file attribute mtime the files that are older than the days to keep are deleted. 2. Reading of BloomFilter checkpoint file to restore the BloomFilter state in each context 3. If implemented the custom context could read its state checkpoint file Custom Context For file related checkpointing the custom context the developer does not need to take care about writing and reading the input file related checkpoint files. This will be handled by the Page 47 of 118

common ITEMain code. The tuple attributes used in the custom context needs to be specified at type definition of ContextCheckpointStreamType in the ContextDataProcessor.spl file.

48 common ITEMain code. The tuple attributes used in the custom context needs to be specified at type definition of ContextCheckpointStreamType in the ContextDataProcessor.spl file. Figure 30: Custom Context composite On graceful shutdown the ContextCustom composite will receive one tuple on the Command stream. It is required to respond on this command tuple on the Command Response Stream. The command tuple has a command attribute of type rstring with the value write in shutdown phase. If the custom logic writes one or more state checkpoint files then the attribute checkpointfiles needs to be filled. Otherwise keep it empty and set the success attribute to true. On startup a read command is received to restore the custom context state with the checkpoint file. If the state checkpoint file does not exist the ContextCustom operator is trained on the Recovery stream with the data stored via Checkpoint Stream during file processing and no read command will be triggered on command stream. At the end of training phase a punctuation is received on Recovery Stream and it is required to forward this punctuation to the Recovery Response Stream. The state is not only cleared and reloaded on job startup. The ITEMain Scheduler triggers the refresh of context resources. During refresh time a clear command will be received on Command Stream. After clearing of your custom context, you need to send a clear response on the Command Response Stream. Finally the data tuples on Recovery Stream shall be used to build up the context state again. A window punctuation is used to signal the end of this refresh phase and must be forwarded to the Recovery Response Stream Checkpointing Configuration The user can configure the checkpoint directory where all checkpoint files are located. There is no need to create this directory because it will be created at job startup if it does not exist. The name of the submission parameter to specify the checkpoint base directory is: checkpointdir With the following configuration parameters in the configuration file config.cfg, the user can disable writing of checkpoint files. CHAIN_SINK_DEDUP_CHECKPOINT_FILES_DISABLED=1 CHAIN_SINK_CUSTOM_CONTEXT_CHECKPOINT_FILES_DISABLED=1 By default the checkpoint file writing is enabled. If it is disabled by the user the context state can not be recovered after job restart. Page 48 of 118

49 The number of days the context resources are configured for can be set with the following configuration parameters: DEDUP_BLOOM_DAYS_TO_KEEP CONTEXT_CUSTOM_DAYS_TO_KEEP File header descriptions of ITE applications Rejection output Column Name Remarks number 1 Error code Error code defined by parser 2 Error text Optional error description 3 Invalid line Invalid line or record in the input file Table 1: Header of ITE rejection file Chain deduplication statistics output file Column Name Sub-element Remarks number 1 starttime checkpoint start time 2 endtime checkpoint start time 3 duration checkpoint duration 4 filesprocessed number of processed files 5 entries number of processed entries 6 command checkpoint command 7 checkpointfile checkpoint file pattern 8 success is command achieved? true/false 9.1 id job Id job Id of involved PE 9.2 PE number PE Id 9.3 instance Id instance Id 10 logfiledate date of logfile Table 2: Header of chain deduplication statistics output file Statistics output Page 49 of 118

50 Column number Name Sub-element Remarks 1 filename name of processed file 2 filetype type of file: CSV, BIN or ASN 3 filesize size of file 4 urgent is urgent? true/false 5 reprocess in reprocess? true/false 6 filetime timestamp of file 7 processingstartedat begin of file processing 8 processingstoppedat end of file processing 9 duration duration time in seconds 10 starttimestamp 11 sentrecords number of sent valid records 12 nrecordsdecodedasn1 number of ASN decoded records 13 nbytesdecodedasn1 number of ASN decoded bytes 14 nbytesreceivedasn1 number of ASN received bytes 15 nbytesdroppedasn1 number of ASN dropped bytes 16 recordstatsasn1 ASN statistics name 17 nbytesreceivedbin number of BIN decoded records 18 nbytesdroppedbin number of BIN decoded bytes 19 recordcountsbin number of BIN received bytes 20 bytecountsbin number of BIN dropped bytes 21 recordstatsbin BIN statistics name 22 nlinesdroppedcsv number of dropped lines in CSV 23.1 recordstatscsv name BIN statistics name 23.2 errors BIN statistics number of errors 23.3 records BIN statistics number of records 24 warnings number of file warnings 25 errors number of file errors 26 duplicate number of file duplicates 27 invalidfile is invalid? true/false 28 rejectedinvalids number of rejected invalid records 29 recordduplicates number of record duplicates 30 tablestats 31 tablefiles 32 chainsequencenumber sequence of processed records in chain 33 level1id chain level 1 Id Page 50 of 118

51 Column number Name Sub-element Remarks 34 level2id chain level 2 Id 35 logfiledate date of the logfile Table 3: Header of ITE statistics file 4.3 Lookup Manager Architecture The main functionality of Lookup Manager is the preparation of lookup data in the repository used by ITE applications. The TEDA framework is designed for the setup of one or more ITE jobs and only one common Lookup Manager acting as single writer job for the lookup repository. This design warrants integrity of data in the repository. The application control composite is responsible for coordination of the write process and the file processing in the ITE application. This warrants that the implemented business logic use the same lookup data from repository in ITE job during the processing of the whole ITE input file. The lookup repository is distributed to all hosts. It is implemented in the design using separate hosttag definition for the host cache processor and hostexlocation placement feature. The Figure 31 shows all possible process flows in the application. Figure 31: Application design of Lookup Manager General process flow in the Lookup Manager The entry point of Lookup Manager process is the command file reader block that scans the input directory and reads the input command file. The content will be forwarded to command handler that checks the number of commands in the file and sends a file end trigger with number of commands to the result collector on punctuation after last read command. In the next block, the command request is queued by application control till all Page 51 of 118

52 processes in ITE jobs are stopped. The processing and run states of the application control are described in Application management. If CSV files are defined as single kind of source with lookup data then the command is only forwarded to command splitter composite directly else it will pass the way to the ByPassCheck operator. It checks the kind of the command. In case the command type includes the pre-/postfix identifying CSV files as the source, the command will be forwarded to the command splitter directly. In other case it will be forwarded to the DB check Switch operator that forward the command request to the command splitter in dependency on results provided by DBStatusChecker operator. If database is defined as the single kind of source for the lookup repository then the application control composite sends the command request to the DB check Switch operator directly. It forwards the command request in dependency on database status as usual. The DBStatusChecker operator run scheduled checks of the availability of the database, of database connections and user settings and the availability of required database tables periodically. Passing the tests it creates result metrics and it sends the results to the switch operator that pass the command to command splitter on positive result status. In this way, the possibility of database user lock will be reduced. The command splitter prepares the content of command request. It selects applications and segments that needs to be processed. This information including command type is sent to the host cache processor as a trigger. The source splitter determines the sources and prepares the read request to the source. It addresses the ODBC segment reader with ODBCRun operator in case of the database query for exact one corresponding repository segment or it addresses the CSV segment reader using CSVParser operator via FileSource operator in case of CSV file as the input source for one exact corresponding repository segment. The ODBC and CSV segment reader exist for only one repository segment. If more segments are defined then each segment will define own segment reader in parallel. In case of error processing in CSV segment reader, the error log will be traced. In case of error processing in ODBC segment reader, the error log will be traced and the processing element will be stopped. The results of segment reader is provided to host cache processor via command multiplier for each repository segment in parallel. The command multiplier multiplies the command request times number of hosts defined by hostsnum. The host cache processor is responsible for write processing on one host, for collection of all results and for reporting of write process status to result collector. Using the trigger information received at command trigger handler form command splitter composite, each host cache processor collects this information and it can verify the status of each affected repository segment in the write host control operator. The command splitter sends the read data including the command type information and finishes the completed request with punctuation after submission of last data tuple. These data are received by segment writer in segment processor of host cache processor in parallel for each repository segment. The segment writer is responsible for deletion and creation of a shared memory segment and shared memory map stores for one repository segment. The definition of shared memory segments is described in LookupMgrCustomizing.xml file. The segment writer operator removes shared memory segment if exists and creates new shared memory segment on initial command type. It opens the memory segment on update command type. During the Page 52 of 118

53 initial write process, the operator only creates new elements in the defined map stores. The update process can append new map elements in the stores or replace existing elements by key definition. The statistic metrics are created and they are sent to metric writer on each data tuple. After the last data tuple on punctuation, the metrics are sent to metric writer as well as the status report is sent to write host control operator by segment writer. The write host control operator is present once per host. It collects the reports from all segment writers and it sends the host completion report to the result collector after the verification with the list of repository segment delivered in the trigger information from command splitter composite. The result collector is responsible for the collection of host results, statistics creation for statistics writer and reporting of command process completion to application control composite. The result collector verifies the number commands processed with information delivered by command handler. If all commands are processed then the result collector moves the command file to archive directory. The statistics writer generates statistic files described in Table 5 to folder defined by statisticsdir parameter in chapter Statistics directory on submission If any command is queued in the application control then the next command will process else a start request is sent to ITE jobs to enable the file processing in the ITE jobs Lookup Manager hosttag definitions This description is only relevant in case of multi-host solution. It is defined by setting of configuration parameter ENABLE_MULTIHOST. The main part of Lookup Manager including application control uses the hosttag defined by <namespace>_lookup_writer. The host cache processor uses hosttag defined by <namespace>_lookup_host_writer. The hosttag <namespace>_lookup_host_writer must be placed to the hosts using lookup repository defined by ITE hosttag <namespace>_chain_<id>. The number of hosts is variable and must correlate with the number of <namespace>_lookup_host_writer hosttags. The default number of hosts is set to 1. It can be adjusted by submission parameter hostsnum Lookup Manager file header descriptions Command input file Column Name Remarks number 1 command name of command type: initial, update or delete 2 repository segment optional: single repository segment name 3 supported ITE jobs optional: comma-separated list of ITE jobs defined by namespace Table 4: Header of command input file Lookup statistics output file Page 53 of 118

54 Column Name Sub-element Remarks number 1 filename name of processed file 2 command command type: initial/update 3 commandstartedat begin of file processing 4 commandprocessedat end of file processing 5.1 hoststatistics hostname list of host statisitics by name of host including the list of repository segments 5.2.a segments.sgmnt shared memory segment name 5.2.b segments. table repository segment name 5.2.c segments. reserved reserved but not allocated memory for shared memory in bytes 5.2.d segments. free free memory in bytes 5.2.e segments. free % free memory in % 5.2.f segments. processed processed data at segment writer 6 logfiledate date of the logfile Table 5: Header of lookup statistics output file 5 Ingest Transform Enrich (ITE) Customizing The main goal of TEDA framework setup is the creation of a file-to-file ITE application. The typical telecommunication applications process files. The file naming schema usual includes the origin information about the source of the file, like network element, region or timestamp. This information is sometimes included in the content of the file. Sometimes, the project does not require de-duplication feature. Therefore different variants are designed in TEDA framework depends on these information. The provided three ITE variants depend on context usage, de-duplication and information included in file naming. The variant selection is part of tooling supporting the setup of application. It is possible to change the variant selection during the customizing and configuration of ITE jobs. During the coding phase of business logic is the change of ITE variant possible but the effort increase because of implementation of additional custom composites or deletion of not more required custom composites. 5.1 Setup configuration The setup process of TEDA requires a definition of some parameters that are used during the setup of applications. The setup process is described in Customized application setup Configuration parameter Application control directory This parameter is defined during the setup proces but it can be changed on submission or as configuration parameter of the application. Name: APPL_CTRL_FILE_PATH See chapter: Application control directory Page 54 of 118

5.2 ITE variants 5.2.1 Variant A File processing without special context processing Figure 32: Variant A of ITEMain job The application if free from resources which need to define a processing context.

55 5.2 ITE variants Variant A File processing without special context processing Figure 32: Variant A of ITEMain job The application if free from resources which need to define a processing context. Choose variant A if neither record de-duplication nor custom context processing is required. Configuration parameters to select Variant A: CONTEXT_DISABLED=1 FILEINGESTION_LEVEL1_SPLIT_DISABLED=1 TRANSFORMER_LEVEL1_SPLIT=0 Page 55 of 118

5.2.2 Variant B Context processing with split based on file content Figure 33: Variant B of ITEMain job Variant B is designed for projects processing filenames without any context information as part

56 5.2.2 Variant B Context processing with split based on file content Figure 33: Variant B of ITEMain job Variant B is designed for projects processing filenames without any context information as part of the filename. Context split is done based on file content and every single data entity (tuple) determines the context. Configuration parameters to select Variant B: CONTEXT_DISABLED=0 FILEINGESTION_LEVEL1_SPLIT_DISABLED=1 TRANSFORMER_LEVEL1_SPLIT=1 Page 56 of 118

5.2.3 Variant C Context processing with split based on file name Figure 34: Variant C of ITEMain job The filename determines the context. Select variant C if filenames contain context information.

57 5.2.3 Variant C Context processing with split based on file name Figure 34: Variant C of ITEMain job The filename determines the context. Select variant C if filenames contain context information. Scanned filenames are sent to the context specific processing chains. Configuration parameters to select Variant C: CONTEXT_DISABLED=0 FILEINGESTION_LEVEL1_SPLIT_DISABLED=0 TRANSFORMER_LEVEL1_SPLIT=0 Page 57 of 118

58 5.3 Typical use-cases There are three major variants (A-C). This variant needs to be defined during the setup phase. The use-cases will be defined by adaptation of configuration parameters in the config/config.cfg file that is located in each TEDA job structure Mediation The mediation scope is the simplest use case the ITE application can be configured for. In the mediation scope each input file is converted to a single output file. For example binary input files are transformed to ASCII output files. Figure 35: Illustrated Mediation use case Typical configuration parameters for ETL use case: RecordStream type at transformer output no stream type conversion at Transformer Transformer code sets or updates existing schema only TRANSFORMER_OUTPUT_TYPE=2 RecordFileWriter: One FileSink for the common record stream CHAIN_SINK_TYPE=1 Page 58 of 118

Figure 36: Stream types in mediation use case You may want to convert the stream types at Transformer. For this you can define the TypesCustom.TransformedRecord type which is used in TypesCommon.

59 Figure 36: Stream types in mediation use case You may want to convert the stream types at Transformer. For this you can define the TypesCustom.TransformedRecord type which is used in TypesCommon.RecordStreamType, For best performance try to avoid changing the schema and keep the TypesCommon.ReaderOutStreamType and TypesCommon.RecordStreamType the same. The TypesCommon.ChainSinkStreamType can be used to reduce the schema if you don t need all attributes to be written to the file. This can be customized with the TypesCustom.ChainSinkType schema. Page 59 of 118

60 5.3.2 Extract-Transform-Load The main use case of ITE application is in the Extract-Transform-Load (ETL) scope. Usually input files must be parsed and converted to a flat structure and depending on requirements the records must be transformed for several target tables. Figure 37: Illustrated ETL use case for three tables Typical configuration parameters for ETL use case: Table stream type at transformer output port TRANSFORMER_OUTPUT_TYPE=0 TableFileWriter (one FileSink per table): CHAIN_SINK_TYPE=0 CHAIN_SINK_TARGET_TABLES=TABLE.A,TABLE.B,TABLE.C Page 60 of 118

61 Figure 38: Stream types in ETL use case convert to TableStreamType before Dedup If ITE is configured for writing table files then there needs to be a conversion from record type to table stream type (see chapter ITE Record to table stream conversion). This can be done in the custom DataProcessor composite or after Dedup like in the figure below. Figure 39: Stream types in ETL use case convert to TableStreamType after Dedup Page 61 of 118

62 In case you want to convert to table stream type after Dedup then you need to enable the custom Transformer postcontext composite and the conversion must be coded in this composite. To achieve the configuration of the figure above you need to set the following parameters in config.cfg file: TRANSFORMER_POSTCONTEXT_COMPOSITE=PostContextDataProcessor TRANSFORMER_OUTPUT_TYPE=2 CHAIN_SINK_TYPE=0 CHAIN_SINK_TARGET_TABLES=TABLE.A,TABLE.B,TABLE.C Page 62 of 118

63 5.3.3 Campaign-Management The ITE application can be used for campaign management. For this use case the ITE needs to create output files that contain data aggregated across several input files and not only output files per input file. Anyhow a combination of both output file types is still supported output files related to a single input file output files based on several processed input files. Figure 40: Illustrated Campaign-Management use case Page 63 of 118

Figure 41: Stream types in campaign use case To achieve the configuration of the figure above you need to set the following parameters in config.

64 Figure 41: Stream types in campaign use case To achieve the configuration of the figure above you need to set the following parameters in config.cfg file: CONTEXT_DISABLED=0 CONTEXT_CUSTOM_COMPOSITE_ENABLED=1 TRANSFORMER_POSTCONTEXT_COMPOSITE=PostContextDataProcessor TRANSFORMER_OUTPUT_TYPE=2 CHAIN_SINK_TYPE=0 CHAIN_SINK_TARGET_TABLES=TABLE.A,TABLE.B,TABLE.C The campaign management business logic needs to be implemented in the ContextDataProcessor.spl composite in the <APPLICATION_NAMESPACE>.context.custom namespace. For the context related output files this composite needs to have its own FileSinks. The FileWriter of the ChainSink can not be used for this since these are input file related only and does not write custom context output. If you do not need the ChainSink FileWriter writing table files or record files then you should not forward the data tuples received from the Context composite at the custom PostContextDataProcessor composite. Page 64 of 118

65 5.4 Context configuration Beside the config.cfg file there is another important configuration file for the following context settings: Number of required contexts and context identifier Number of processing chains per context Maximum BloomFilter entries per context. The name of this file can be changed with the configuration parameter CHAIN_MAPPING_FILE. Figure 42: Context configuration file (CHAIN_MAPPING_FILE) 5.5 Adding additional toolkits for ITE application The ITE application depends on Standard SPL toolkit and the teda toolkit only. Since the ITE application is build using external builder from StreamsStudio the dependency settings in info.xml are not sufficient. Additional toolkit locations must be specified in the toolkitslist.xml file. The makefile checks if the toolkitlist.xml exists. If present it will be added with the -t option to the sc command. This ensures that the project is build with the same settings from command line and StreamsStudio. Find a sample below for adding the rules toolkit dependency in toolkitslist.xml file in the project directory: <?xml version="1.0" encoding="utf-8" standalone="no"?> <toolkitlist xmlns=" xmlns:xsi=" xsi:schemalocation=" toolkitlistmodel.xsd"> <toolkit directory="/absolute/path/to/streams_install/toolkits/com.ibm.streams.rules"/> </toolkitlist> Page 65 of 118

66 5.6 External trigger for initialization phase The initialization phase can be controlled via a startup control file. Use this for the case the ITE application should not start initialization of context resources during startup before external activities are completed. For example the ITE application is configured as variant B or C and the BloomFilters are used for table row de-duplication and we need to prepare the hashcode files as a preceding activitiy. 1. The hashcode files are built on table row tuples by a CustomTableReader job, which reads the table rows from a database. 2. In the CustomTableReader job the hashcode files are created with same column attributes like in ITEMain Transformer business logic. 3. CustomTableReader job writes the CONTEXT_STARTUP_CTRL_FILE 4. BloomFilter training is waiting until CONTEXT_STARTUP_CTRL_FILE contains done. 5. ITEMain job is reading the hashcode files as checkpoint files Figure 43: Context startup control Page 66 of 118

67 6 Lookup Manager Customizing The Lookup Manager job is responsible for the preparation of lookup data in the lookup repository. The lookup repository is the shared memory storing the extracted data that are required for implementation of business logic of ITE application. The source of the data could be DB2 or Oracle database, or files in CSV format. The usecases are described in Typical use-cases. The initialization of preparation of the lookup repository is triggered at run-time by a special file in the input folder of Lookup Manager job. The use-cases will be defined by adaptation of configuration parameters in the config/config.cfg file that is located in each TEDA job structure. The definition of user specific data like segments and store structures or streams schema and data assignment are defined in LookupMgrCustomizing.xml file that is located in root directory of the Lookup Manager project. The general tuple flow is described in chapter General process flow in the Lookup Manager All configuration parameters are involved in the make process of the TEDA applications. Some of them can be changed on submission of the job. In this case the configuration parameter provides the default setting for the corresponding submission time parameter. 6.1 Customizing file The Lookup Manager job does not require any coding. The user defines the required setting in a customizing file LookupMgrCustomizing.xml. There are 3 main parts that must be defined for each application that should be supported by the Lookup Manager: Command mapping definition of supported command types and supported segments. An example of command mapping definition is shown in Figure 45. Segment customizing the segment definition including shared memory segment name and size. It includes stores, keys and values and assignment to the streams schema definition. An example of segment customizing definition is shown in Figure 46. Streams schemas this is the definition of streams schemas assigned to the lookup segments including all required attributes and types for the segment customizing. An example of command mapping definition is described in Lookup customizing file, see Figure 49 and Lookup customizing file and connections.xml file see Figure 51. The definition of these parts depends on use-cases described in Typical use-cases. The application names are defined by their namespace definition. Page 67 of 118

The LookupCommand defines the type of the command supported by the specific application as attribute of the CommandMapping element.

68 The Figure 44 presents the example of definition of application namespaces in the LookupMgrCustomizing.xml file. Figure 44: Sample definition of ITE jobs In the first part of the ITE application definition, the user must define the commands that needs to be supported by the ITE application. The LookupCommand defines the type of the command supported by the specific application as attribute of the CommandMapping element. The elements assigned to specific command type are defined by SegmentName elements. The value must match repository segment name defined in the segment customizing. The Figure 45 shows an example of the command mapping. Figure 45: Example of command mapping The Figure 46 shows the example of repository segment customizing. The repository segment is defined by Name. It correlates to one shared memory segment defined by MemSegmentName. The reserved size of the repository segment must be defined by user. It is not the allocation size. This is maximal possible allocation size of memory. The segment can define one or more stores. The are defined by Name in StoreDefinitions element. The store is defined by SPLKeyAssigment and SPLValueAssigment elements. The SPLValueAssigment block defines the tuple attributes that are defined as map value. It defines the SPL types, the attribute names and the assignment of value attributes to attributes of incoming stream. Page 68 of 118

Each value is defined by key. The key is defined by one SPLKeyAssigment element. It defines the SPL type of the key and the assignment to the attribute or combined attributes of incoming streams.

69 Each value is defined by key. The key is defined by one SPLKeyAssigment element. It defines the SPL type of the key and the assignment to the attribute or combined attributes of incoming streams. Figure 46: Example of segment customizing The content of this file will be used for code generation at compilation of the Lookup Manager application. 6.2 Command file The command file must define the extension *.cmd. The Figure 47 presents the example and the relationships. The Lookup Manager supports two following command types in the command file: initial this command creates new shared memory segments update this command update the content of existing shared memory segments delete this command removes the content of shared memory map elements defined by key attributes. This is supported with CSV input files only. The file can define one or more command lines, each including one command type. The lines are separated in 3 columns by semicolons. The header of the command file is decribed in the chapter Command input file. The first column <command> is the mandatory one. The user can decide at run-time which source must be used for the preparation of the repository. Of course, the sources must be available. Only in case that both sources CSV and DB are enabled, the user selects the file source processing by creation of a prefix file_ or postfix _file at <command-type> in the command file, e.g. file_update. The delete command sets the file request automatically. The database must not define any pre-/postfix at the <command-type>, e.g. initial. Only one shared memory segment as <segment> can be addressed by one command line. For more then one segment, the user must define additional lines in the command file. Page 69 of 118

The Lookup Manager supports the command processing on one or more <applications> defined by the application unique name or the comma separated list of application unique names, e.g. job.a,job.

70 The Lookup Manager supports the command processing on one or more <applications> defined by the application unique name or the comma separated list of application unique names, e.g. job.a,job.b The empty segment or application column takes all segments and / or applications determinated from LookupMgrCustomizing.xml file that defines the customizing settings. The location of a command file is in the sub-directory 'cmd' of the input folder of the Lookup Manager job. The path to the input folder can be defined as submission parameter 'inputdir'. Figure 47: Example of Command File and references This figure shows an example of a command file, here called initial_all.cmd that is located in default folder./data/in/cmd. In the first line, the initial command will be processed on lookup segment DimMaster1. The jobs itejob.a and itejob.b are explicit addresses to be supported on command processing. The second line process an update command on DimMaster1 lookup segment for all known applications using CSV file as source. In this case the database and CSV files are defined in parallel as sources of data for the lookup repository. So the _file postfix is required to trigger the read processing from CSV file. The last line starts the update command on all known lookup segments for itejob.a application. 6.3 Setup configuration The setup process of TEDA requires a definition of some parameters that are used during the setup of applications. The setup process is described in Customized application setup Configuration parameter Application control directory This parameter is defined during the setup process but it can be changed on submission or as configuration parameter of the application. Page 70 of 118

Name: APPL_CTRL_FILE_PATH See chapter: Application control directory 6.4 Typical use-cases 6.4.1

71 Name: APPL_CTRL_FILE_PATH See chapter: Application control directory 6.4 Typical use-cases Lookup repository using CSV files as input source Description The easiest lookup configuration is the delivery of comma-separated-files (CSV) including lookup data. This could be any generated content e.g. CSV file created by Open Office Calc or database table export in form of CSV file e.g. DEL-file. It is useful to store the content of the whole CSV files or only selected columns in the lookup repository because the faster access to the data during the file processing. In this case there is not database code included in the generated SPL sources. The general process is described in chapter General process flow in the Lookup Manager. In this use-case, the application control provides the command requests directly to command splitter composite. The Figure 48 describes the process flow in this specific use-case. Figure 48: Architecture and information flow on case of CSV input Required configuration settings Lookup customizing file The LookupMgrCustomizing.xml file includes definitions that are used for code generation during compilation of the job implementation. Only the StreamsSchema element is involved in the implementation of the use-case. The SegmentName attribute defines the repository segment. The same name is required for the CSV input file. The naming schema of this file is defined by Page 71 of 118

<SegmentName>.csv, e.g. DimMaster1.csv. The deletion command requires additional extension: <SegmentName>.del.csv, e.g. DimMaster1.del.csv. If command is applied but the input CSV files <SegmentName>.

72 <SegmentName>.csv, e.g. DimMaster1.csv. The deletion command requires additional extension: <SegmentName>.del.csv, e.g. DimMaster1.del.csv. If command is applied but the input CSV files <SegmentName>.csv or <SegmentName>.del.csv are not provided, the job will stop and because of missing input data. In this case the Lookup Manager job must be canceled and submitted again. The initial command must be processed else an inconsistent content of repository could exit. This SegmentName attribute value must be unique for each content. If the same content is need by an other ITE application then it must be defined once for each application as a copy. If the content of input files differ then separate files must be created and they must be identified by different file names and SegmentName attributes. The indexing of columns must be defined in the same LookupMgrCustomizing.xml file and it is required for the assignment of data in the column of CSV file to SPL tuple attributes. The indexing begins with 0 and the CSV must define constant number of comma separated columns. The attribute type mapping is done by CSV parser automatically. The Figure 49 shows the cross-references between the customizing file, configuration settings and input source. Figure 49: Example of CSV input in LookupMgrCustomizing.xml DB disabled parameter Name: DBDISABLED Value to set: '1' See chapter: DB disabled Enable input file parameter Name: LOOKUP_ENABLE_INPUT_FILE Value to set: '1' See chapter: Enable input file Optional configuration settings There are not further optional configuration parameters to be set. Page 72 of 118

6.4.2 Lookup repository using database as input source 6.4.2.1 Description The most telecommunication projects stores the history data and master data in database.

73 6.4.2 Lookup repository using database as input source Description The most telecommunication projects stores the history data and master data in database. The required business logic only uses the subset of the data. The streams processing requests the lookup information for every tuple passing the business logic composites. This is huge challenge for the performance of whole system and database to send the database queries continuously during the file processing. It is useful to create the lookup repository for this subset of data in shared memory because this kind of data is seldom updated in the database by the system, e.g. once per day. The general process is described in chapter General process flow in the Lookup Manager. In this use-case, the application control composite sends the command request directly to DBCheck Switch operator that forwards the request to command splitter composite dependent on results of DBStatusChecker operator. The Figure 50 describes the process flow in this specific use-case. Figure 50: Architecture and information flow on case of database input Required configuration settings for database use-case The usage of database as source for the lookup repository requires the installation and configuration of database that is not in scope of this document. Only the DB toolkit and the Lookup Manager specific settings are described in this document that are required for the job configuration Lookup customizing file and connections.xml file The entries in the LookupMgrCustomizing.xml file must correlate with settings in connections.xml file required by com.ibm.streams.db toolkit. Each Lookup Manager job requires the definition of repository segment schemas for each application in the LookupMgrCustomizing.xml file. The SegmentName attribute Page 73 of 118

must much the name of the access_specification defined in the connections.xml file. This file is valid for the definition connection_specifications defined by name attribute, e.g.

74 must much the name of the access_specification defined in the connections.xml file. This file is valid for the definition connection_specifications defined by name attribute, e.g. SAMPLE and definition of all database queries including user_connection and native_schema settings. Therefore the name of the access_specification must define the prefix 'lookup_' for the lookup processing on the repository segment. This prefix followed by the name of the segment identifies the full required naming of the access_specification, e.g. lookup_dimmaster1. The Name attributes of the SchemaValueDefinition elements in the LookupMgrCustomizing.xml file must match the name attributes of column element definitions used in the naming_schema element of connections.xml file. The SPLType attributes of SchemaValueDefinition elements must follow the type definition mapping described by com.ibm.streams.db toolkit of corresponding settings in SchemaValueDefinition elements and column elements. The Figure 51 shows the cross-references between the customizing file, configuration settings and input source. Figure 51: Example of attribute mapping in LookupMgrCustomizing.xml and connections.xml files Database configuration The Lookup Manager supports 2 kinds of databases: DB2 and Oracle. The environment setting for the implementation and driver support are required. These settings are described in detail in the com.ibm.streams.db toolkit documentation. If DB2 database is selected as repository source then the DB2 profiles must be sourced. The used DBStatusChecker operator requires the UnixODBC installation next to DB2 or Oracle installation. All information about the installation and configuration of UnixODBC are available in the com.ibm.streams.db toolkit documentation and in the internet: Page 74 of 118

75 The Table 6 gives short overview about the required environment variables mentioned in the database descriptions. DB2 value ORACLE value UNIXODBC_HOME <path-to-unixodbc> <path-to-unixodbc> ODBCSYSINI <path-to-odbc.ini> <path-to-odbc.ini> STREAMS_ADAPTERS_ODBC_DB2 yes or 1 unset STREAMS_ADAPTERS_ODBC_ORACLE unset yes or 1 STREAMS_ADAPTERS_ODBC_INCPATH <db2-path>/include $UNIXODBC_HOME/include STREAMS_ADAPTERS_ODBC_LIBPATH <db2-path>/lib64 $UNIXODBC_HOME/lib ORACLE_HOME unset <path-to-oracle-home> PATH $UNIXODBC_HOME/bin: <path-to-db2>:$path $UNIXODBC_HOME/bin: $ORACLE_HOME: $ORACLE_HOME/bin:$PATH LD_LIBRARY_PATH $UNIXODBC_HOME/lib: $LD_LIBRARY_PATH Table 6: Overview environment setting for both database variants DB disabled parameter Name: DBDISABLED Value to set: '0' See chapter: DB disabled DB vendor parameter Name: DBVENDOR Value to set: 'DB2' or 'ORACLE' See chapter: Error: Reference source not found DB connection parameter Name: DBCONNECTION Value to set: user defined string See chapter: DB connection Enable input file parameter Name: LOOKUP_ENABLE_INPUT_FILE Value to set: '0' See chapter: Enable input file Optional configuration settings for database use-case DB name parameter Name: DBNAME Value to set: '' See chapter: DB name DB user parameter Name: DBUSER $UNIXODBC_HOME/lib: $ORACLE_HOME/lib: $LD_LIBRARY_PATH Page 75 of 118

76 Value to set: '' See chapter: DB user DB password parameter Name: DBPWD Value to set: '' See chapter: DB password Lookup repository using database and CSV files as input source Description The most telecommunication projects stores the history data and master data in database. The required business logic only uses the subset of the data. The streams processing requests the lookup information for every tuple passing the business logic composites. This is huge challenge for the performance of whole system and database to send the database queries continuously during the file processing. It is useful to create the lookup repository for this subset of data in shared memory because this kind of data is seldom updated in the database by the system, e.g. once per day. The general process is described in chapter General process flow in the Lookup Manager. In this use-case, the application control composite sends the command request to the ByPassCheck operator that checks the kind of source defined in the command type attribute. In case of file source, it forwards the tuples to the command splitter composite. In other case, the DB status decides about the forwarding of the command request via Switch operator that is triggered by DbStatuschecker operator. The Figure 52 describes the process flow in this specific use-case. Figure 52: Architecture and information flow for both source types of input Page 76 of 118

77 Required configuration settings The settings are required for database as well as for CSV files. The configuration for the database is set as described in chapter Required configuration settings for database use-case. The setting for CSV files are described in chapter Required configuration settings ( ) DB disabled parameter Name: DBDISABLED Value to set: '0' See chapter: DB disabled DB vendor parameter Name: DBVENDOR Value to set: 'DB2' or 'ORACLE' See chapter: DB vendor DB connection parameter Name: DBCONNECTION Value to set: user defined string See chapter: DB connection Enable input file parameter Name: LOOKUP_ENABLE_INPUT_FILE Value to set: '1' See chapter: Enable input file Optional configuration settings The optional settings are described in the database solution in chapter Optional configuration settings for database use-case Simulation mode Description The simulation mode does not use any source for the lookup repository. It simulates the output data of source sending empty data. This use-case is useful to simulate the command processing in the Lookup Manager and the setting of application control master states. The command request flow will be processed in all levels Required configuration settings DB disabled parameter Name: DBDISABLED Value to set: '1' See chapter: DB disabled Enable input file parameter Name: LOOKUP_ENABLE_INPUT_FILE Value to set: '0' See chapter: Enable input file Page 77 of 118

6.4.4.3 Optional configuration settings There is not optional configuration setting. 7 Developer hints 7.

78 Optional configuration settings There is not optional configuration setting. 7 Developer hints 7.1 Application namespace The Namespace setting is the core of the namespace used in implementation of SPL applications. It is the major definition of application structure. The namespace defined by this parameter is the location of all common composites of the TEDA framework in the TEDA application. All composites that have to be customized by TEDA user, they must be located in the customer namespace. It extends the defined namespace using.custom. The namespace must be unique for each TEDA application. The Figure 53 presents the example of namespace definition for framework composites and custom composites. Figure 53: Namespace definition The application namespace is used by: Application control as a job identifier Output file naming Hosttag definition The Figure 54 presents the relationships between configuration parameter and generated output. Figure 54: Application namespace in TEDA framework 7.2 Setup definition file The setup definition file is an XML based description the structure and the configuration of all TEDA application projects. The structure of the default setup definition file is shown in Figure 55. It defines three major parts: Common settings There are 2 common parameter in the common settings: Project base path describes the path to the base of all TEDA projects. Defined as PROJECT_BASE in Interactive setup dialog Page 78 of 118

79 Application control path describes the path to folder that shares the application control master states in the Lookup Manager and the client states in the ITE applications. This configuration parameter is described in chapter Application control directory. ITE applications The setup definition supports the setup of none, one or more ITE applications. There are only 3 mandatory attributes: Namespace defines the structure of the application. The namespace must be unique for each application. The corresponding parameter in the Interactive setup dialog is the APPLICATION_NAMESPACE parameter. Application folder it is the name of application project folder in the workspace. The corresponding parameter in the Interactive setup dialog is the APPLICATION_FOLDER. Variant this attribute defines the variant of the ITE application: A, B or C. The details about the ITE application variants are described in chapter ITE variants. Lookup disabled this parameter defines if the ITE job will be controlled the application control of Lookup Manager. The detailed information is described in Disable Lookup. Lookup Manager application The setup definition supports the setup of none or one Lookup Manager application. There are 2 mandatory attributes and 2 mandatory sub-elements. The attributes are defined like in the ITE application: Namespace like in ITE application. Application folder like in ITE application. The element are: DbDisabled it enables or disables the database as repository source. The corresponding parameter in the Interactive setup dialog is the DB_DISABLED parameter. The configuration parameter is described in DB disabled. EnableInputFile it enables or disables the usage of CSV input files as repository source. The corresponding parameter in the Interactive setup dialog is the LOOKUP_ENABLE_INPUT_FILE parameter. The configuration parameter is described in Enable input file. Page 79 of 118

80 Figure 55: Example of setup definition file The example of TEDA applications creation process using provided TelcoFrameworkSample.xml file: teda-setup-projects -x /home/streamsadmin/telcoframeworksample.xml /home/streamsadmin/telcoframeworksample.xml validates Selected job type: LookupMgr Selected job type: ITEMain Selected variant: Context processing with split based on file name in File Ingestion (default) Creating project folder: /home/streamsadmin/teda_applications/lookupmgr Job (LookupMgr - common.lookup) finished. Creating project folder: /home/streamsadmin/teda_applications/iteappl Job (ITEMain - itejob.a) finished. [streamsadmin@host]$ 7.3 External builder Makefile The TEDA framework provides own Makefile that is the source of the external builder in the Streams Studio. The syntax is defined per default in the GNU references: This TEDA specific Makefile defines few input attributes. They are described by the help target: [streamsadmin@host]$ cd /home/streamsadmin/teda_applications/iteappl [streamsadmin@host]$ make help make <all clean help> [SPLC_CMD_ARGS=<value> TOOLKITLIST_PATH=<value> SPL_MAIN_COMPOSITE_LIST=<value> SPL_MAIN_COMPOSITE=<value> SPLC_FLAGS=<value>] Input definitions: SPLC_CMD_ARGS - spl-compiler command arguments e.g."appldbg=g" (default: empty) TOOLKITLIST_PATH - path to toolkitslist.xml ADDITIONAL_TOOLKITS - the ':'-separated list of single toolkit paths. default:. SPL_MAIN_COMPOSITE_LIST - optional file listing the main composite spl/splmm files (curent: /home/streamsadmin/teda_applications/iteappl/main_composites.lst SPL_MAIN_COMPOSITE - optional main composites name. SPLC_FLAGS - flag for spl-compiler default: (empty). The example of make command using optional attribute: [streamsadmin@host]$ make all SPLC_FLAGS='-j 4' In this case the make process will start 4 C/C++ compiler processes in parallel. Page 80 of 118

7.4 ITE FileReader customizing The ITE job has three kind of built-in FileReaders with different Parser operators: 1. FileReaderASN1 using the ASN1Parse operator of the TEDA toolkit 2.

81 7.4 ITE FileReader customizing The ITE job has three kind of built-in FileReaders with different Parser operators: 1. FileReaderASN1 using the ASN1Parse operator of the TEDA toolkit 2. FileReaderCSV using the CSVParse operator of the TEDA toolkit 3. FileReaderStructure using the StructureParse operator of the TEDA toolkit One of these three composite names needs to be selected in the FileReaderCustom (<APPLICATION_NAMESPACE>.chainprocessor.reader.custom). As next step you need to customize the FileReader composite. Depending on the selected FileReader composite different amount of parameters can be set. Figure 56: FileReader composites overview The Operator can be customized by setting the parameters for the composite FileReader in FileReaderCustom The Operator can replaced by setting the parameters for the composite FileReader in FileReaderCustom Custom File Parser Furthermore you can choose to integrate your own custom File Parser if your file type is not supported by the integrated FileReaders. Use the file CustomParserTemplate.spl in the <APPLICATION_NAMESPACE>.chainprocessor.reader.custom namespace as coding template when integrating your own file parser. You need to meet the following requirements when developing your custom FileReader composite: Forward the file information of the input stream to the Statistic Stream Optional: Forward the file information of the input stream to Data Stream if attributes are required by your custom business logic. Send a punctuation at end of file Send the statistic tuple after sending the punctuation. Page 81 of 118

7.5 ITE Multiple FileReaders in one job Add multiple FileReader composite names in the Reader file parser list to support file type specific parsers.

82 7.5 ITE Multiple FileReaders in one job Add multiple FileReader composite names in the Reader file parser list to support file type specific parsers. During build time the FileReaderCore composites adds the FileTypeSplit operators and distributes the filename tuples according the filetype attribute to the corresponding FileReader composite. It is required that all FileReaders have the same ouput schema: Reader output schema Figure 57: Multiple FileReaders a) Create a copy of the FileReaderCustom.spl file in the <APPLICATION_NAMESPACE>.chainprocessor.reader.custom namespace b) Rename the filename and composite name The corresponding sample configuration parameters for the figure above are: READER_FILE_PARSER_LIST=TYPE1 FileReaderCustom1,TYPE2 FileReaderCustom2,TYPE3 FileReaderCustom3 FILEINGESTION_FILE_TYPE_VALIDATOR_COMPOSITE_ENABLED=1 The file type attribute must be set based on the filename in the <APPLICATION_NAMESPACE>.fileIngestion.custom::FileTypeValidator composite to TYPE1 for files to be processed by FileReaderCustom1, to TYPE2 for files to be processed by FileReaderCustom2 and TYPE3 for files to be processed by FileReaderCustom ITE Record to table stream conversion If you have configured the Transformer Output Type to TableStream type (TRANSFORMER_OUTPUT_TYPE is set to 0 or 1) then you need to add the Page 82 of 118

83 TableRowGenerator to convert your stream type to the required Transformer output stream type. Figure 58: Transformer stream types The TableRowGenerator is already included with use <APPLICATION_NAMESPACE>.chainprocessor.transformer::TableRowGener ator; in the <APPLICATION_NAMESPACE>.chainprocessor.transformer.custom::DataPro cessor.spl composite. Add the operator at the end of your transformer flow. It supports one or more input streams and converts the TypesCustom.TableStreamTypes to the required output format having tablename and tablerow attributes. The tablerow attribute contains the prepared string to be loaded to the database. Page 83 of 118

7.7 ITE Punctuation handling in Transformer Composite When developing custom code in Transformer composite you need to take care that the punctuations are forwarded.

84 7.7 ITE Punctuation handling in Transformer Composite When developing custom code in Transformer composite you need to take care that the punctuations are forwarded. You may update the statistic tuple. If not the template code takes care that it is forwarded. Figure 59: Transformer Composite Furthermore you should not increase the number of punctuation inside the Transformer composite. Means one punctuation received on Transformer input port should result to one punctuation at Transformer output port only. The Statistic stream is punctuation free. 7.8 ITE Split streams in Custom Transformer If you add the Split operator in your custom transformer composite to split your business logic (e.g. per file type) then you need to reduce the punctuations created by the Split operator. The punctuation received on input port of Split Operator is sent to all output ports and sending too many punctuations at Transformer Output needs to be avoided. Page 84 of 118

85 Figure 60: Split streams in Transformer composite 7.9 ITE Parallel channels in Custom Transformer For performance reasons you might want to use user-defined parallelism feature in custom Transformer composite to create multiple threads processing the tuples in multiple channels. As precondition you need to disable the Parallelism Feature to be used in ITEMain composites for the processing chains. Set PARALLELISM_FEATURE_ENABLED=0 in the config.cfg file. Page 85 of 118

86 Figure 61: Channels in Transformer composite In the figure above the TableRowGenerator is running in parallel channels. Since all the operators of the ChainProcessor are fused to a PE the TableRowGenerator -operators are running in an own thread individually. Add the PunctReduce operator behind the parallel region to receive the output stream of the channels in order to avoid that too many punctuations are sent to the Transformer output. The PunctReduce operator can be found in your project directory: <APPLICATION_NAMESPACE>.utility::PunctReduce 8 Operations 8.1 Application management 8.2 Trouble shooting Lookup Manager trouble shooting Application setup does not create the application structures ERROR: Page 86 of 118

87 teda-create-projects -x /home/streamsadmin/etc/templates/ TelcoFrameworkSample.xml etc/templates/telcoframeworksample.xml validates Selected job type: LookupMgr Selected job type: ITEMain Selected variant: Context processing with split based on file name in File Ingestion (default) EXIT: Error: /home/streamsadmin/teda_applications/lookupmgr already exists. To continue use '-b' for backup or '-f' force remove Solution: The path to the TEDA application, here /home/streamsadmin/teda_applications/lookupmgr, defined in the setup definition file, here /home/streamsadmin/etc/templates/telcoframeworksample.xml, exists but the user hasn't defined how to handle the application path. One of the parameters '-b' for backup or '-f' for force remove must be set calling the setup command Submission of job fails because of missing hosttags The TEDA framework use framework specific hosttags for the multi-hosts configuration. They must be set in the configuration of the domain. ERROR: st submitjob output/lookupmanagermain/common.lookup.lookupmanager Main.adl... CDISR3027E The job submission, or processing element restart, failed. A placement that meets the constraints of all the processing elements was not found. For more informat ion, see the messages that follow this one. CDISR3041E Job ID 0 cannot be submitted because the job placement constraints cannot b e met. For more information, see the messages that follow this one.... CDISR3070I None of the hosts can be found in the LookupManagerPool host pool.... ensure that the hosts with the following host pool tags are configured and started. The host pool tags are: common_lookup_lookup_writer.... [streamsadmin@host]$ Solution: Add the hosttags to the domain as described in Domain and instance preparation Lookup repository initialization fails ERROR: The initial command file of Lookup Manager is still present in the input directory and the initialization fails. The ITE applications state files are not present or does not change the status in the control directory. Solution: The path to the control directory is not set correct for Lookup Manager and ITE jobs. They must not deviate from each other. Every job must point the same folder. Check the parameter APPL_CTRL_FILE_PATH in the config/config.cfg file of each ITE application and Lookup Manager application. Check used submission parameter applcontroldir in the streams console or configured submission parameters in the Streams Studio for correct setting. Finally, recompile jobs in case of configuration changes or cancel and submit again the jobs using correct parameters. Page 87 of 118

88 Corrupt shared memory segments ERROR: In case the shared memory segments are corrupt and any access to the segments is impossible then the corrupt memory segment must be removed. Solution: There are two ways to remove the shared memory segments: host restart segment deletion from file system The easiest one is the restart of the host. The business needs sometimes require the running host, so that the restart is not allowed. The shared memory segments are stored as device in linux file system. This operation requires root permission. The user must delete the segments from file system using root permissions on each involved host. And finally, the user must start the process of Lookup Repository initialization using the initial command described in chapter: Command file. The example of segment deletion segmaster1: /dev/shm/ segmaster1 segmaster2 sudo rm /dev/shm/segmaster1 [sudo] password for streamsadmin: ls /dev/shm/ segmaster2 mv data/in/cmd/archive/initial_all.cmd data/in/cmd/ ls /dev/shm/ segmaster1 segmaster Lookup reader does not read shared memory segments It is sometime required to check the content of the shared memory segments. Each build of Lookup Memory application creates a standalone lookup repository reader that can read the shared memory segments created by Lookup Manager job. ERROR [streamsadmin@host]$ cd <project-root>/output/lookuprepositoryreadermain [streamsadmin@host]$./bin/standalone Processing started on 'r901h2kc' started atthu Jun 26 15:03: Jun :03: [31391] ERROR #splapptrc,j[0],p[0],readsegement_segmaster1_ma pname1,shmstore M[shmSegment.cpp:shmSegmentOpen:116] - shmsegmentopen can not open se gment with name="segmaster1" reason=no such file or directory 26 Jun :03: [31391] ERROR #splapptrc,j[0],p[0],readsegement_segmaster2_ma pname2,shmstore M[shmSegment.cpp:shmSegmentOpen:116] - shmsegmentopen can not open se gment with name="segmaster2" reason=no such file or directory Processing finished successful. [streamsadmin@host]$ The output 'No such file or directory' coming from Linux operating system is typical for missing shared memory segment. Solution The Lookup Manager job has not been submitted or the write processing failed. Submit again the Lookup Manager job, if not submitted and process initila command with content: initial;; The host has been rebooted and the shared memory segment are not present. Start initialisation command on Lookup Manager. If initial command processed successful and the shared memory segments are not present then check the Lookup Manager statistics for processed records in writer. If processed entries are 0 then check configuration of your use-case if not simulation mode configured Page 88 of 118

89 check if input CSV files are not empty and records are valid in case of CSV source In case of database, check is DB connection works. The DBStatusChecker operator writes an error log in case of missing connection. In case of database, check the syntax and content of database query if requests have processed any output using database tooling Lookup Repository Reader The Lookup Repository Reader is an additional tooling that supports the user in investigation of lookup fault detection. This tool is able the read the content of the shared memory segments and stores. Per default, the tool dumps the data to CSV files in the <projectroot>/data/dump folder. This tool can bring running lookup processes in trouble creating inconsistent repository output. Any job accessing the shared memory segments must not run if the Lookup Repository Reader starts. The handling is very easy. The -help output describes all parameters: cd <project-root>/output/lookuprepositoryreadermain -help InfoSphere Streams - Standalone SPL Application (InfoSphere Streams ) -h, --help Display this help and exit. -d (Deprecated) Log/Trace level: 0 - ERROR, 1 - INFO. -c, --log-to-stderr Output trace messages to standard error. The default is standard output. -k, --kill-after=value Shut down the application after N seconds where N is a positive floating point number. -l, --log-level=int Log level: 0 - ERROR, 1 - WARN, 2 - INFO. -t, --trace-level=int Trace level: 0 - OFF, 1 - ERROR, 2 - WARN, 3 - INFO, 4 - DEBUG, 5 - TRACE. Additional command line arguments can be specified by using <name>=<string-literal>. The optional arguments are: [common.lookup.lookupreader::lookuprepositoryreadermain.]outdir ("dump"), [common.lookup.lookupreader::lookuprepositoryreadermain.]table (""), [common.lookup.lookupreader::lookuprepositoryreadermain.]count ("0"), [common.lookup.lookupreader::lookuprepositoryreadermain.]index ("0"), [common.lookup.lookupreader::lookuprepositoryreadermain.]segment (""), [common.lookup.lookupreader::lookuprepositoryreadermain.]store (""), [common.lookup.lookupreader::lookuprepositoryreadermain.]print ("false"), [common.lookup.lookupreader::lookuprepositoryreadermain.]append ("false"). The following example shows how to use the tooling: [streamsadmin@host]$ cd <project-root>/output/lookuprepositoryreadermain [streamsadmin@host]$./bin/standalone print=true Processing started on 'r901h2kc' started atthu Jun 26 16:17: Segment=segMaster1;Store=MapName1;Type=MAP;index=0;Key=NE1;Value={CODE="NEXX01",NE_ADDRESS= " ",MCCMNC="4930",NAME="NE1"} Segment=segMaster1;Store=MapName1;Type=MAP;index=1;Key=NE2;Value={CODE="NEYY03",NE_ADDRESS= " ",MCCMNC="4940",NAME="NE2"} Segment=segMaster2;Store=MapName2;Type=MAP;index=0;Key=0102;Value={SERVICE_CODE="0102",TIME _START=" :00:00.0",TIME_END=" :03:15.4"} Segment=segMaster2;Store=MapName2;Type=MAP;index=1;Key=SMS0304;Value={SERVICE_CODE="SMS0304",TIME_START=" :01:55.8",TIME_END=" :04:00.0"} Segment=segMaster2;Store=MapName2;Type=MAP;index=2;Key=ABC0102;Value={SERVICE_CODE="ABC0102",TIME_START=" :01:55.8",TIME_END=" :04:00.0"} Processing finished successful Lookup Repository Reader parameter The are following parameters: outdir Output path for output dumps. If relative then to data directory. Default: dump, Page 89 of 118

90 table name of one specific repository segment. If empty then all segments. Default: "", count number of entries to be dumped. If "0" then all entries. Default: "0", index first entry that should begin the dump output. It begins with "0". Default: "0", segment name of one specific shared memory segment. If empty then all segments. Default: "", store name of one specific store in the shared memory segment. If empty then all segments. Default: "", print Possible values: true or false. If true then print out to standard output (terminal). Default: "false", append Possible values: true or false. If true then append to existing dump file. Default: "false". 9 References Parameter APPL_CTRL_FILE_PATH ENABLE_MULTIHOST Chapter Application control directory Enable Multi-host Table 7: Common configuration parameters Parameter applcontroldir Chapter Application control directory on submission Table 8: Common submission parameter Parameter DBCONNECTION DBDISABLED DBNAME DBPWD DBUSER DBVENDOR LOOKUP_ENABLE_INPUT_FIL E Chapter DB connection DB disabled DB name DB password DB user DB vendor Enable input file Table 9: Lookup Manager configuration parameters Page 90 of 118

91 Parameter CHAIN_EXCEPTION_HANDLER_ENABLED CHAIN_MAPPING_FILE CHAIN_SINK_ARCHIVE_INPUT_FILES_IN_DATE_DIR CHAIN_SINK_AUDIT_TABLE_WRITER_ENABLED CHAIN_SINK_CUSTOM_CONTEXT_CHECKPOINT_FILES_DISABLE D CHAIN_SINK_DEDUP_CHECKPOINT_FILES_DISABLED CHAIN_SINK_FILE_WRITER_CUSTOM_COMPOSITE CHAIN_SINK_TABLE_FILES_OUTPUT CHAIN_SINK_TARGET_TABLES CHAIN_SINK_TYPE CLEANUP_SCHEDULE_HOUR CLEANUP_SCHEDULE_MDAY CLEANUP_SCHEDULE_MINUTE CLEANUP_SCHEDULE_WDAY CONTEXT_CUSTOM_COMPOSITE_ENABLED CONTEXT_CUSTOM_DAYS_TO_KEEP CONTEXT_DISABLED CONTEXT_STARTUP_CTRL_FILE DEDUP_BLOOM_DAYS_TO_KEEP Chapter Exception handler Context configuration file Chain Sink archive input files in date dir Chain Sink Audit table file writer Chain Sink Custom checkpoint files Chain Sink Dedup Checkpoint files Chain Sink File Writer custom composite Chain Sink table files output Chain Sink target tables Chain Sink Type Cleanup configuration Cleanup configuration Cleanup configuration Cleanup configuration Context custom composite Custom context days to keep Context Context startup control file Dedup Bloom Page 91 of 118

92 Parameter Chapter DEDUP_BLOOM_P DEDUP_DISABLED DISABLE_LOOKUP ENABLE_CUSTOM_CODE FILEINGESTION_ACKNOWLEDGED_LEVEL2_SPLIT FILEINGESTION_DIR_SCAN_GET_NANOSECONDS FILEINGESTION_FILE_DEDUP_DAYS_TO_KEEP FILEINGESTION_CUSTOM_FILE_TYPE_VALIDATOR_ENABLED FILEINGESTION_FILEDATE_IN_FILENAME_PAT_COUNT FILEINGESTION_FILEDATE_IN_FILENAME_PAT_<index> FILEINGESTION_FILEDATE_ORDER_<index> FILEINGESTION_LEVEL1_ID_EXTRACTION_PATT FILEINGESTION_LEVEL1_SPLIT_DISABLED FILEINGESTION_PROCESS_FILE_PAT FILEINGESTION_REPROCESS_FILE_PAT FILEINGESTION_SCAN_SLEEP_TIME FILEINGESTION_SORT_ATTRIBUTE FILEINGESTION_SORT_ORDER Dedup Bloom probability Dedup Disable Lookup Enable custom code File ingestion level2 split File ingestion time precision File ingestion deduplication window File ingestion file type customization File ingestion file-date from file-name extraction File ingestion file-date from file-name extraction File ingestion file-date from file-name extraction File ingestion context split File ingestion context split File ingestion file-name pattern File ingestion file-name pattern File ingestion scanner configuration File ingestion sort configuration File ingestion sort configuration Page 92 of 118

93 HOSTPOOL_CUSTOM PARALLEL_CHAINS Parameter PARALLELISM_FEATURE_ENABLED READER_CUSTOM_FILE_STATISTICS READER_CUSTOM_PARSER_STATISTICS READER_FILE_PARSER_LIST READER_OUTPUT_SCHEMA READER_PRE_FILE_READER_COMPOSITE_ENABLED READER_ENCODING_ENABLED READER_FILE_COMPRESSION TAP_COMPOSITE_FOR_POST_CONTEXT_DATA_PROCESSOR_BU NDLE TAP_COMPOSITE_FOR_TRANSFORMER_BUNDLE TAP_POST_CONTEXT_DATA_PROCESSOR_OUTPUT_FOR_BUNDL E TAP_TRANSFORMER_OUTPUT_FOR_BUNDLE TRANSFORMER_COMPOSITE TRANSFORMER_LEVEL1_SPLIT TRANSFORMER_LOOKUP_SCHEMA TRANSFORMER_OUTPUT_TYPE Chapter Custom host pools Parallel Chains Parallelism Feature Reader custom file statistics Reader custom parser statistics Reader file parser list Reader output schema Reader prefile reader composite Reader CSV file encoding Reader file compression Tap composite for post context data processor bundle Tap composite for Transformer bundle Tap post context data processor bundle Tap Transformer bundle Transformer Composite Transformer Level-1-Split Transformer Lookup Schema Transformer Output Type Page 93 of 118

94 Parameter TRANSFORMER_POSTCONTEXT_COMPOSITE Chapter Transformer postcontext composite Table 10: ITE configuration parameters Parameter applcontrolfiles csvfilesdir dbname dbpass dbuser hostsnum inputdir statisticsdir Chapter Application control files on submission CSV source directory on submission Database name on submission Database password on submission Database user on submission Number of hosts on submission Input directory on submission Statistics directory on submission Table 11: Lookup Manager submission parameters Parameter chains chains[00..99] checkpointdir cleanupschedulehour cleanupschedulemday cleanupscheduleminute cleanupschedulewday daystokeep disablelookup inputdir inputdirlist outputdir rawfiletap statisticsdir tapdir Chapter Number of Chains (on submission) Number of Chains per Context (on submission) Checkpoint directory (on submission) Cleanup schedule Hour (on submission) Cleanup schedule Month day (on submission) Cleanup schedule Minute (on submission) Cleanup schedule Week day (on submission) Filename deduplication window (on submission) Disable lookup (on submission) Input directory (on submission) Input directory list file (on submission) Output directory (on submission) File ingestion tap option (on submission) Statistics directory (on submission) File ingestion tap directory (on submission) Table 12: ITE submission parameters Page 94 of 118

9.1 Configuration parameter 9.1.1 Common configuration parameter 9.1.1.1 Application control directory Name: APPL_CTRL_FILE_PATH Description: The application control master of Lookup Manager and the

95 9.1 Configuration parameter Common configuration parameter Application control directory Name: APPL_CTRL_FILE_PATH Description: The application control master of Lookup Manager and the application control clients of ITE jobs have to exchange and synchronize the status information. This directory is the main synchronization point. If the TEDA applications run on more then one host then this folder must be located in shared file system. The default setting for this parameter must be set in the setup phase. See chapter: Get started. The Figure 62 presents the relationships between configuration parameter and file system. The directory can be changed by submission time parameter applcontroldir. Figure 62: Location of application control directory o Range: String defining directory path o Default: none default value Enable Multi-host Name: ENABLE_MULTIHOST Description: The multi-host solution generates hostpools and hosttags for the configuration of repository cache to different hosts. If multi-host feature is disable then neither hosttags nor host-pools are generated. The standard InfoSphere Streams hosttags are used for this single host solution. If multi-host solution is enabled then the generated SPL code requires setting of hosttags to the configured hosts. o Range: '1' or '0' '1' The multi-host configuration is enabled. The generated SPL code requires hosttags definitions. They are described in Domain and instance preparation, Lookup Manager hosttag definitions or in Customizing after setup. '0' The multi-host configuration is disabled and all PEs must run on the same host. o Default: '0' ITE application Parallelism Feature Name: PARALLELISM_FEATURE_ENABLED Page 95 of 118

96 Description: Parameter to enable the user-defined parallelism feature. If enabled the number of parallel chains can be increased at job submission time with the submission parameter chains. Otherwise the number of chains are generated at compile time only and can not be changed at submit time anymore. You need to disable this feature if you want to use the User-defined parallelism feature in your custom code since nested parallel regions are not supported. o Range: 0 or 1 o 0: Fix number of chains o 1: Number of chains can be changed at submit time o Default: '0' Parallel Chains Name: PARALLEL_CHAINS Description: Parameter to define the number of parallel worker chains for variant A and B. For variant C this parameter is ignored since the number of chains per context are configured in the CHAIN_MAPPING_FILE file. The number of parallel chains can be increased at job submission time if Parallelism Feature is enabled with the submission parameter chains. o Range: 1..n Custom host pools Name: HOSTPOOL_CUSTOM Description: Optional parameter to add custom host pools. If custom code requires operators be placed with custom host tags then define your custom host tags with this parameter. Multiple host tags can be separated by comma. o Range: One or more host tags separated by a comma o Example: 'hosttag1,hosttag2' Disable Lookup Name: DISABLE_LOOKUP Description: Parameter to disable lookup code. ITE application runs independent from Lookup Manager if lookup code is disabled. o Range: 0 or 1 o 0: ITE is controlled by LookupManager o 1: ITE lookup code is disabled o Default: '0' Exception handler Name: CHAIN_EXCEPTION_HANDLER_ENABLED Description: Optional parameter to add an operator catching all exception into the processing chain. Since all operators are fused in Chainprocessor the exceptions raised in reader or transformer composites are caught. This and prevents a PE crash. File statistic tuple is generated and input file is moved to failed directory. Page 96 of 118

97 Recommendation: Enable this parameter in variant A only, because in other variants the logical processing chain is spread across multiple PEs. o Range: 0 or 1 o 0: disabled o 1: enabled o Default: '0' Context configuration file Name: CHAIN_MAPPING_FILE Description: Optional parameter to change the name of context configuration file. The parameter is obsolete for variant A. If not specified the default filename is used. See Context configuration chapter. o Range: String defining the filename o Default: './config/chainspercontextmapping.txt' Cleanup configuration The framework logic performs cyclical a clean up of the file de-duplication store, the record deduplication storage and of the checkpoint file directory. These parameters define the schedule of this operation. A start-up of the framework performs one clean-up independent of the schedule. This parameters are obsolete if file-name (see File ingestion file-name pattern) deduplication and record deduplication (see Dedup) are disabled. A schedule is determined with 6 components: mday, wday, hour and minute. Each schedule component is either * - which means all, or a comma separated list of ranges. A range may be a single digit or 2 digits separated with a hyphen. The digits must be within the values margins. The values margins are for: mday: wday: 0..6 with 0=Sunday.. 6=Saturday hour: minute:0..59 An empty component produces never a match. Hint: The ranges must not contain any additional spaces or separators. To determine the point in time when a tuple has to be generated all components are combined with a logical and conjunction. mday and wday are combined with a logical or conjunction. The result of the decision to generate a trigger is made with the following logical operation: cleanup = && ((mday==now) (wday==now)) && (hour==now) && (minute==now) Example: Produce a trigger tuple every day, every six hours: mday: ''; wday: '*'; hour: '*0,6,12,18; Page 97 of 118

98 minute:'0'; Example: Produce a trigger tuple every day, in hour 7am..10am and hour 2pm-9pm, 1 at minute 10 mday: ''; wday: '*'; hour: '7-10,14-19'; minute:'10'; CLEANUP_SCHEDULE_HOUR Description: This optional parameter defines the hour part of the schedule. a schedule component Range: '0'..'23' Default: '0,6,12,18' CLEANUP_SCHEDULE_MDAY Description: This optional parameter defines the mday part of the schedule. a schedule component Range: '1'..'31' Default: '' CLEANUP_SCHEDULE_MINUTE Description: This optional parameter defines the minute part of the schedule. a schedule component Range: '0'..'59' Default: '15' CLEANUP_SCHEDULE_WDAY Description: This optional parameter defines the wday part of the schedule. a schedule component Range: '0' - Sunday '1' - Monday '2' -Tuesday '3' - Wednesday '4' - Thursday '5' - Friday '6' - Saturday Default: '*' File ingestion level2 split Name: FILEINGESTION_ACKNOWLEDGED_LEVEL2_SPLIT Description: This parameter selects the type level2-split. o Range: '0' or '1' '0' Round Robin Level2-Split '1' Acknowledged Level2-Split Page 98 of 118

99 o Default: '1' File ingestion time precision Name: FILEINGESTION_DIR_SCAN_GET_NANOSECONDS Description: If this optional parameter is enabled the nanoseconds part of the scanned file time is evaluated. If this parameter is disabled, all nanoseconds fields are set to zero from the directory scanner. You might disable this option if your filesystem does not support the nanoseconds precision. Range: '0' or '1' '0' - disabled '1' - enabled Default: '1' File ingestion deduplication window Name: FILEINGESTION_FILE_DEDUP_DAYS_TO_KEEP Description: This parameter defines the number of days the filenames are hold in the list to detect duplicate files. After this period the entries will be cleaned up from the persistent and transient filename deduplication storage. Range: 1..n Default: '7' File ingestion file type customization Name: FILEINGESTION_CUSTOM_FILE_TYPE_VALIDATOR_ENABLED Description: This parameter is used to control the customization of the file type attribute. If disabled the default file type attribute is delivered. If enabled the custom operator '<APPLICATION_NAMESPACE>.fileIngestion.custom::FileTypeValidato r' is used to generate the file type attribute. The file type attribute is used to enable different file readers for different file types. Parameter Enable custom code disables this function globally '0': - disabled '1' enabled Default: '0' File ingestion file-date from file-name extraction This optional set of parameters defines a list of pattern and orders to extract the file-time from file-name The parameter is evaluated if parameter FILEINGESTION_SORT_ATTRIBUTE equals to 4 (which is GetTimeSortAttributeFromName). In all other cases these parameters are ignored. Name: FILEINGESTION_FILEDATE_IN_FILENAME_PAT_COUNT Description: This parameter defines the count of the following pattern and order lists. '0'.. n Default: '1' Page 99 of 118

100 Name: FILEINGESTION_FILEDATE_IN_FILENAME_PAT_<index> Description: This indexed parameter list defines the list of search patter to extract the file-date string from file-name. <index> must be replaced with the list index starting from zero. Each regular expression in the list must have exactly one part in brackets, which isolated the date from the rest of the file-name. If the file-name does not produce at least to one list entry the file is considered an invalid file and moved to the invalid files folder. valid regular expressions Default: '_([0-9]{8})' Note: These parameters are SPL string literals. Thus a backslash character must be escaped with a backslash. Name: FILEINGESTION_FILEDATE_ORDER_<index> Description: This indexed parameter list with the file-date string format. This list must correspond to list pattern list. Valid orders order of year, month, day, hour, minute and second are: YYYYMMDDhhmmss YYYYMMDD MMDDYYYY MMDDYYYYhhmmss DDMMYYYY DDMMYYYYhhmmss YYYY_MM_DD_hh_mm_ss MM_DD_YYYY_hh_mm_ss DD_MM_YYYY_hh_mm_ss YYYY_MM_DD_hh_mm_ss_mmm MM_DD_YYYY_hh_mm_ss_mmm DD_MM_YYYY_hh_mm_ss_mmm Default: YYYYMMDD File ingestion context split This set of parameters is used for the customization of the region split. Name: FILEINGESTION_LEVEL1_SPLIT_DISABLED Description: This optional parameter disables the file ingestion context split. '0' diabled (means the split is enabled) '1' enabled (means the split is disabled) Default: '0' Name: FILEINGESTION_LEVEL1_ID_EXTRACTION_PATT Description: This parameter defines a regular expression which is used to extract the level 1 id (the context) from the file-name. The expression must have exactly one part in brackets, which isolated the level 1 id from the rest of the file-name. If the file-name does not produce a pattern match it is assigned to the default context. The context configuration is defined in Context configuration file. The parameter is required when the file ingestion context split is enabled. Page 100 of 118

101 regular expression Default: none File ingestion file-name pattern Name: FILEINGESTION_PROCESS_FILE_PAT Description: This optional parameter defines the process file file pattern. Matching file-names are recognized from the file ingestion logic and checked if they were processed already in the past. If so the files are moved to the duplicate files folder. If this pattern equals the FILEINGESTION_REPROCESS_FILE_PAT the file name de-duplication is disabled at all. regular expressions Default: '.*\\.DAT$' Name: FILEINGESTION_REPROCESS_FILE_PAT Description: This optional parameter defines the file name pattern which defines the re-process files. Matching file-names are recognized from the file ingestion logic and and bypass the duplicate check. If this pattern equals the FILEINGESTION_PROCESS_FILE_PAT the file name de-duplication is disabled at all. regular expressions Default: '' Note: These parameters are SPL string literals. Thus a backslash character must be escaped with a backslash File ingestion scanner configuration Name: FILEINGESTION_SCAN_SLEEP_TIME Description: This option parameter defines the sleep time between the scan cycles of the file ingestion logic in seconds. This time is never reduced in high load situations. a floating point value > 0 Default: '5.0' File ingestion sort configuration With these parameters it is possible to configure the sort function of the file ingestion logic. By default the file-names are not sorted. In contrast to the DirectoryScanner operator of the spl toolkit the TEDA directory scanner delivers the file-names unsorted. Additionally the file-time attribute can be overwritten here. By default the file-time attribute is taken from the m-time of the file. There are some built in sort functions available and the user can provide special custom sort algorithm. If parameter Enable custom code is disabled, these parameters are forced to their defaults. Name: FILEINGESTION_SORT_ATTRIBUTE Description: This optional parameter defines the extraction of the sort attribute. The sort attribute can be used in the down-streams sort operator. Page 101 of 118

102 '0' GetNoSortAttribute: provide nor sort attribute, use default m-time as file-time '1' GetTimeSortAttribute: the sort attribute is taken from file m-time, use default m-time as file-time '2' GetNameSortAttribute: the sort attribute is taken from file-name use default m-time as file-time '3' - GetSizeSortAttribute: the sort attribute is taken from file size use default m-time as file-time '4' GetTimeSortAttributeFromName: the file-time and sort attribute is taken from the file name and converted into a time. If this conversion fails, the file is marked as invalid and moved into the invald file folder. The parameters File ingestion file-date from file-name extraction are used to define patterns an orders. '5' use the custom get sort attribute operator: <APPLICATION_NAMESPACE>.fileIngestion.custom::GetSortAttribute Default: '0' Name: FILEINGESTION_SORT_ORDER Description: This optional parameter is used to define the sort order. '0' no sort '1' sort ascending by sort attribute. The sort window is one scan cycle. '2' sort descending by sort attribute. The sort window is one scan cycle. '3' enable custom sort operator: <APPLICATION_NAMESPACE>.fileIngestion.custom::FileSort Default: '0' Context startup control file Name: CONTEXT_STARTUP_CONTROL_FILE Description: Optional parameter to trigger the initialization of context resources (e.g. BloomFilters). File is checked on startup only. Context resource training is waiting until CONTROL_CONTEXT_STARTUP_CTRL_FILE contains done. The file must be located in control directory. See 'External trigger for initialization phase' chapter. o Range: String defining the filename o Default: 'tablereader.ctl' Reader custom file statistics Name: READER_CUSTOM_FILE_STATISTICS Description: Optional parameter to enable custom file statistics. Use the TypesCustom::CustomFileStatisticsStreamType to add attributes to the statistic schema. It should be used if CHAIN_SINK_TYPE!= 0. o Range: '0' or '1' '0' disabled '1' enabled Page 102 of 118

103 o Default: '0' Reader custom parser statistics Name: READER_CUSTOM_PARSER_STATISTICS Description: Optional parameter to enable custom parser statistics. Use the TypesCustom::CustomParserStatisticsStreamType to define the parser statistic output stream type. It should be used to integrate your own parser. o Range: '0' or '1' '0' disabled '1' enabled o Default: '0' Reader file parser list Name: READER_FILE_PARSER_LIST Description: This parameter is used in FileReaderCore composite to integrate one or more parser operators. Add multiple composites in the list if you have multiple parsers. o Parameter value contains a list of File Types and Parser Composites <FILE_TYPE> <COMPOSITE>,... 'TYPE1 FileReaderCustom1,TYPE2 FileReaderCustom2,TYPE3 FileReaderCustom3' sample for 3 parsers o Default: 'TYPE0 FileReaderCustom' Reader output schema Name: READER_OUTPUT_SCHEMA Description: Output schema of FileReader composite o Range: Any custom stream type o Default: 'ReaderRecordType' Reader pre-file reader composite Name: READER_PRE_FILE_READER_COMPOSITE_ENABLED Description: This parameter enables the integration of the pre-file reader composite. o Range: '0' or '1' '0' disabled '1' enabled (<APPLICATION_NAMESPACE>.chainprocessor.reader.cu stom::prefilereader is integrated) o Default: '0' Page 103 of 118

104 Reader CSV file encoding Name: READER_ENCODING_ENABLED Description: This parameter enables the encoding parameter in FileReaderCSV. If enabled then the encoding parameter is passed from FileReaderCustom to FileSource in FileReaderCSV. o Range: '0' or '1' '0' disabled '1' enabled o Default: '0' Reader file compression Name: READER_FILE_COMPRESSION Description: This parameter is used in the composites FileReaderASN1,FileReaderCSV and FileReaderStructure to enable the compression parameter for the FileSource operator Default compression mode is gzip and can be changed at FileReaderCustom by setting compression parameter. Enable this parameter only if your input files are compressed. This configuration parameter value support a comma separated list of FileReader Composites. o Range: 'FileReaderCSV,FileReaderASN1,FileReaderStructure ' Tap composite for post context data processor bundle Name: TAP_COMPOSITE_FOR_POST_CONTEXT_DATA_PROCESSOR_BUNDLE Description: Optional parameter to set the name of the custom composite that receives the tap streams from all PostContextDataProcessor composites. o Range: name of the custom composite o Default: 'PostContextDataProcessorTap' Tap composite for Transformer bundle Name: TAP_COMPOSITE_FOR_TRANSFORMER_BUNDLE Description: Optional parameter to set the name of the custom composite that receives the tap streams from all TRANSFORMER_COMPOSITE composites. o Range: name of the custom composite o Default: 'TransformerTap' Page 104 of 118

105 Tap post context data processor bundle Name: TAP_POST_CONTEXT_DATA_PROCESSOR_OUTPUT_FOR_BUNDLE Description: This parameter enables the post context data processor tap. If enabled the composite configured with TAP_COMPOSITE_FOR_POST_CONTEXT_DATA_PROCESSOR_BUNDLE is used o Range: '0' or '1' '0' disabled '1' enabled o Default: '0' Tap Transformer bundle Name: TAP_TRANSFORMER_OUTPUT_FOR_BUNDLE Description: This parameter enables the transformer tap. If enabled the composite configured with TAP_COMPOSITE_FOR_TRANSFORMER_BUNDLE is used o Range: '0' or '1' '0' disabled '1' enabled o Default: '0' Transformer Composite Name: TRANSFORMER_COMPOSITE Description: Composite for custom transformation/lookup/mapping of records. This composite implements the main business logic. Composite must be located in <APPLICATION_NAMESPACE>.chainprocessor.transfomer.custom namespace. o Range: Name of the custom composite o Default: 'DataProcessor' Transformer Level-1-Split Name: TRANSFORMER_LEVEL1_SPLIT Description: Parameter to enable level 1 split at output of Transformer. If enabled the level1id attribute must be set based on business logic to address the right context. o Range: '0' or '1' '0' disabled (Variant A or C) '1' enabled Page 105 of 118

106 Transformer Lookup Schema Name: TRANSFORMER_LOOKUP_SCHEMA Description: Schema definition for attributes filled by Lookup operator. This type is added to READER_OUTPUT_SCHEMA. o Range: custom streams type o Default: 'TypesCustom.LookupType' Transformer Output Type Name: TRANSFORMER_OUTPUT_TYPE Description: Parameter to select the output schema of the custom Transformer composite. o Range: '0' or '1' '0' TableStream One tuple contains a single table row and one hash code for dedup If record result to multiple table rows or to different tables then several tuples must be sent by Transformer '1' TableStream with custom schema extension TypesCustom.ExtendedTableStream is used and extends the table schema. For instance if lookup data is evaluated in custom PostDedupProcessor or in CustomContext. '2' RecordStream TypesCustom.TransformedRecord is used a) CHAIN_SINK_TYPE=1 or CHAIN_SINK_TYPE=2 b) (Custom) PostDedupProcessor is creating the table row tuples o Default: '2' Transformer postcontext composite Name: TRANSFORMER_POSTCONTEXT_COMPOSITE Description: Optional custom post context data processor in namespace <APPLICATION_NAMESPACE>.chainsink.custom o Range: composite name Enable custom code Name: ENABLE_CUSTOM_CODE Description: Before start coding custom code in custom namespace composites this parameter needs to set to '1'. By default the custom code is disabled and replaced by built-in sample code for quick start purpose to show the job functionality. This paramter is a global setting. If this parameter is disabled all other custom code parameters are disabled. (File ingestion file type customization) Page 106 of 118

107 o o Range: '0' or '1' '0' custom code not used '1' custom code enabled Default: '0' Dedup Name: DEDUP_DISABLED Description: This parameter can be used to disable the record deduplication. If disabled no BloomFilter operator is integrated in context composite. o Range: '0' or '1' '0' deduplication enabled '1' deduplication disabled o Default: '0' Dedup Bloom days to keep Name: DEDUP_BLOOM_DAYS_TO_KEEP Description: This parameter defines the period in days for which the number of entries is configured for. Older entries are removed at cleanup schedule. o Range: 1..n o Default: '7' Dedup Bloom probability Name: DEDUP_BLOOM_P Description: This parameter configures the BloomFilter probability option and sets the rate of acceptable false positives. o Range: '0.1' or less o Default: '0.001' Context Name: CONTEXT_DISABLED Description: This parameter disables the processing context. o Range: '0' or '1' '0' context enabled - Variant B or Variant C '1' context disabled - Variant A o Default: '1' Page 107 of 118

108 Custom context days to keep Name: CONTEXT_CUSTOM_DAYS_TO_KEEP Description: This parameter defines the period in days for which the custom context is configured for. Older entries are removed at cleanup schedule. o Range: 1..n o Default: '1' Context custom composite Name: CONTEXT_CUSTOM_COMPOSITE_ENABLED Description: This parameter enables the custom composite for custom context logic o Range: '0' or '1' '0' custom context disabled (default) '1' custom context enabled o Default: '0' Chain Sink archive input files in date dir Name: CHAIN_SINK_ARCHIVE_INPUT_FILES_IN_DATE_DIR Description: If this option is enabled the input files are moved to archive/<date>/ directory. o Range: '0' or '1' '0' disabled '1' enabled o Default: 'o' Chain Sink Audit table file writer Name: CHAIN_SINK_AUDIT_TABLE_WRITER_ENABLED Description: Optional parameter to integrate a custom composite to write audit table files. Custom Audit table writer gets the statistic stream as input and can write additional table files based on the statistics. o Range: '0' or '1' '0' disabled '1' enabled o Default: '0' Chain Sink Custom checkpoint files Name: CHAIN_SINK_CUSTOM_CONTEXT_CHECKPOINT_FILES_DISABLED Page 108 of 118

109 Description: This optional parameter can be used to disable the writing of custom context checkpoint files. Attention: No custom context recovery after crash supported when this option is disabled o Range: '0' or '1' '0' enables writing of file related checkpoint files '1' disables writing of file related checkpoint files o Default: '0' Chain Sink Dedup Checkpoint files Name: CHAIN_SINK_DEDUP_CHECKPOINT_FILES_DISABLED Description: This optional parameter can be used to disable the writing of dedup checkpoint files. Hashcode files are written to restore the BloomFilter state on job restart. Attention: No BloomFilter recovery after crash supported when this option is disabled o Range: '0' or '1' '0' enables writing of hash code files '1' disables writing of hash code files o Default: '0' Chain Sink File Writer custom composite Name: CHAIN_SINK_FILE_WRITER_CUSTOM_COMPOSITE Description: Custom file writer composite name in namespace <APPLICATION_NAMESPACE>.chainsink.custom Parameter is valid only if parameter CHAIN_SINK_TYPE=2 o Range: custom composite name o Default: 'FileWriterCustom' Chain Sink table files output Name: CHAIN_SINK_TABLE_FILES_OUTPUT Description: The tables files are moved to the configured output directory/directories. o Range: '0' or '1' or '2' '0' load/ (contains all tables files) '1' load/<inputfilename>/ (each sub directory contains all tables files belonging to the same input file) '2' load/<date>/ (each sub directory contains all tables files processed on the same day) o Default: '0' Page 109 of 118

110 Chain Sink target tables Name: CHAIN_SINK_TARGET_TABLES Description: This parameter configures the tablenames used in TableFileWriter. For each tablename one dedicated FileSink is generated. The tablenames must be set if CHAIN_SINK_TYPE=0 o Range: tablenames e.g. FCT.FCT_SAMPLE,FCT.FCT_OTHER Chain Sink Type Name: CHAIN_SINK_TYPE Description: This parameter selects the chain sink file writer composite o Range: '0' or '1' or '2' '0' TableFileWriterCore (writes 1..n table files) '1' RecordFileWriterCore (writes 1 record output file) '2' CHAIN_SINK_FILE_WRITER_CUSTOM_COMPOSITE o Default: '1' Lookup Manager DB disabled Name: DBDISABLED Description: The source of data to be stored in lookup repository could be a database or set of CSV files. This parameter disables the DB requests in Lookup Manager jobs. It will be defined during setup process of application structure, see Get started chapter. And it will be used at compilation time. o Range: '1' or '0' '1' This is the default value. The DB requests are disabled and and there are not required further parameters in configuration file. '0' The DB request are enabled and further DB parameter configurations are required. The Lookup Manager uses the ODBCRun operator from Database Toolkit to send ODBC requests. All required Database Toolkit settings must be set as described in IBM Knowledge Center - Database Toolkit. If DB requests are enabled in the setup phase of Lookup Manager then this parameter is set 'false' and an example of connections.xml file required by Database Toolkit is delivered to the root of project structure. o Default: '1' o Value to set: '1' Enable input file Name: LOOKUP_ENABLE_INPUT_FILE Page 110 of 118

Description: This parameter enables the CSV files as source for the creation of lookup repository content. It will be set during setup process of application structure, see Get started chapter.

111 Description: This parameter enables the CSV files as source for the creation of lookup repository content. It will be set during setup process of application structure, see Get started chapter. The source CSV files are located in data directory of the Lookup Manager job. The directory can be changed by submission time parameter csvfilesdir. The Figure 63 presents the configuration settings and the relationships to input sources. Figure 63:Enable input files as source overview o Range: '1' or '0' '1' The CSV requests are enabled and there are not required further parameters in configuration file. '0' The CSV requests are disable and there are not required further parameters in configuration file. o Default: '1' DB vendor Name: DBVENDOR Description: This parameter defines which database vendor should be supported by com.ibm.streams.db toolkit. The Lookup Manager supports 2 kinds of database: DB2 and Oracle. The environment setting for the implementation and driver support are required. Only one DB vendor can be selected and the settings of the other vendor must not be set. This value will be set as a value of the input parameter in the call of the ODBCRun operator. o Range: 'DB2' or 'ORACLE' 'DB2' This is the default value. The DB2 database is selected as the repository source. The STREAMS_ADAPTERS_ODBC_DB2 environment variable must be set. 'ORACLE' The ORACLE database is selected as the repository source. The STREAMS_ADAPTERS_ODBC_ORACLE environment variable must be set. o Default: 'DB2' Page 111 of 118

Tasktop Sync - Cheat Sheet

Tasktop Sync - Cheat Sheet 1 Table of Contents Tasktop Sync Server Application Maintenance... 4 Basic Installation... 4 Upgrading Sync... 4 Upgrading an Endpoint... 5 Moving a Workspace... 5 Same Machine...