Information Management

Size: px

Start display at page:

Download "Information Management"

Delphia Turner
6 years ago
Views:

1 Information Management ebook See inside for 40% OFF offer Compliments of IBM Press

2 Special offer IBM Press is excited to have important new DB2 and Information Management books now available! For a limited time, we are offering a selection of our new and classic Information Management titles including all of the books excerpted in this ebook at a 40% discount. To order books at the 40% discount, you can follow this link to a page that lists all of the books for sale, as well as some related podcasts and articles from our esteemed authors. Also, if you click thru any of the links embedded in this ebook, the discount will automatically appear when you put a book on sale in your shopping cart.

3 Information Management ebook Understanding DB2, Second Edition Raul F. Chong Xiaomei Wang Michael Dang Dwaine R. Snow Chapter 2: DB2 at a GlanCE: The Big Picture DB2 9 for Linux, UNIX, and Windows, Sixth Edition X George Baklarz Paul Zikopoulos Chapter 8: purexml Storage EnginE Understanding DB2 9 Security Rebecca Bond Kevin Yeung-Kuen See Carmen Ka Man Wong Yuk-Kuen Henry Chan Chapter 11: DatabaSE Security Keeping It Current DB2 SQL PL Zamil Janmohamed Clara Liu Drew Bradstock Raul Chong Michael Gao Fraser McArthur Paul Yip Chapter 3: Overview of SQL PL Language Elements Mining the Talk Scott Spangler Jeffrey Kreulen Chapter 5: Mining to Improve Innovation An Introduction to IMS Dean Meltz Rick Long Mark Harrington Robert Hain Geoff Nicholls Chapter 18: Application Programming in Java More Titles of Interest Offered at 40% off Safari Books Online More Information

4 Information Management ebook Copyright 2008 by Pearson Education, Inc. Published by: Pearson Education 800 East 96th Street Indianapolis, IN USA Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and IBM Press was aware of a trademark claim, the designations have been printed with initial capital letters or all capitals. NOT for RESALE

6 Raul F. Chong Xiaomei Wang Michael Dang Dwaine R. Snow Understanding DB2 Learning Visually with Examples, Second Edition The Easy, Visual Way to Master IBM DB2 for Linux, UNIX, and Windows Fully Updated for Version 9.5 IBM DB2 9 and DB2 9.5 provide breakthrough capabilities for providing Information on Demand, implementing Web services and Service Oriented Architecture, and streamlining information management. Understanding DB2: Learning Visually with Examples, Second Edition, is the easiest way to master the latest versions of DB2 and apply their full power to your business challenges. Written by four IBM DB2 experts, this book introduces key concepts with dozens of examples drawn from the authors experience working with DB2 in enterprise environments. Thoroughly updated for DB2 9.5, it covers new innovations ranging from manageability to performance and XML support to API integration. Each concept is presented with easy-to-understand screenshots, diagrams, charts, and tables. This book is for everyone who works with DB2: database administrators, system administrators, developers, and consultants. With hundreds of well-designed review questions and answers, it will also help professionals prepare for the IBM DB2 Certification Exams 730, 731, or 736. COVErage includes Choosing the right version of DB2 for your needs Installing and configuring DB2 Understanding the DB2 environment, instances, and databases Establishing client and server connectivity Working with database objects Utilizing breakthrough purexml technology, which provides for nativexml support Mastering administration, maintenance, performance optimization, troubleshooting, and recovery Understanding improvements in the DB2 process, memory, and storage models Implementing effective database security Leveraging the power of SQL and XQuery hardcover 1056 pages Table of Contents Preface Acknowledgments About the Authors 1. Introduction to DB2 2. DB2 at a Glance: The Big Picture 3. Installing DB Using the DB2 Tools 5. Understanding the DB2 Environment, DB2 Instances, and Databases 6. Configuring Client and Server Connectivity 7. Working with Database Objects 8. The DB2 Storage Model 9. Leveraging the Power of SQL 10. Mastering the DB2 purexml Support 11. Implementing Security 12. Understanding Concurrency and Locking 13. Maintaining Data 14. Developing Database Backup and Recovery Solutions 15. The DB2 Process Model 16. The DB2 Memory Model 17. Database Performance Considerations 18. Diagnosing Problems A. Solutions to the Review Questions B. Use of Uppercase versus Lowercase in DB2 C. IBM Servers D. Using the DB2 System Catalog Tables Resources Index

7 C HAPTER 2 DB2 at a Glance: The Big Picture T his chapter is like a book within a book: It covers a vast range of topics that will provide you with not only a good introduction to DB2 core concepts and components, but also an understanding of how these components work together and where they fit in the DB2 scheme of things. After reading this chapter you should have a general knowledge of the DB2 architecture that will help you better understand the topics discussed in the next chapters. Subsequent chapters will revisit and expand what has been discussed here. In this chapter you will learn about the following: SQL statements, XQuery statements, and DB2 commands DB2 tools The DB2 environment Federation The Database Partitioning Feature (DPF) You interact with DB2 by issuing SQL statements, XQuery statements, and DB2 commands. You can issue these statements and commands from an application or you can use DB2 tools. The DB2 tools interact with the DB2 environment by passing these statements and commands to the DB2 server for processing. This is shown in Figure SQL STATEMENTS, XQUERY STATEMENTS, AND DB2 COMMANDS SQL is the standard language used for retrieving and modifying data in a relational database. An SQL council formed by several industry leading companies determines the standard for these SQL statements, and the different relational database management systems (RDBMSs) follow these standards to make it easier for customers to use their databases. Recent additions to the standard include XML extension functions. These are also refered to as SQL/XML extension functions. This section introduces the different categories of SQL statements and presents some examples. 33

8 34 Chapter 2 DB2 at a Glance: The Big Picture DB2 Commands and SQL/XML, XQuery statements SQL/XML, XQuery statments create bufferpool create tablespace create table alter bufferpool alter tablespace alter table select insert update delete... DB2 System Commands db2set db2start db2stop db2ilist db2icrt db2idrop... DB2 CLP Commands db2 update dbm cfg catalog db list node directory create database list applications list tablespaces... <sql statement> xquery <stmt> DB2 Tools Command Line Tools Command Editor Command Line Processor Command Window Web-based Tools Data Server Admin Console General Administration Tools Control Center Journal License Center Replication Center Task Center Information Information Center Check for DB2 Updates Monitoring Tools Activity Monitor Event Analyzer Health Center Indoubt Transaction Manager Memory Visualizer Setup Tools Configuration Assistant First Steps Register Visual Studio Add-ins... Development Tools IBM Data Studio (seperate installable image) DB2 Environment Instance DB2 Instance-Level Profile Registry Database Manager Configuration File System db Directory Node Directory DCS Directory Database MYDB1 Database Configuration File (db cfg) Logs Bufferpool(s) Syscatspace Tempspace1 Userspace1 My Tablespace1 My Tablespace2 Table 1 Table 2 Table 3 Index 3 Local db Directory Database MYDB2 Database Configuration File (db cfg) Logs Bufferpool(s) Syscatspace Tempspace1 Userspace1 My Ta blespace1 Ta ble X Ta ble Y Por t My Ta blespace2 Ta ble Z Index Z Figure 2.1 Overview of DB2 The XML Query Language (XQuery) specification is a language used for querying XML documents. XQuery includes the XML Path Language (XPath), which is also used to query XML documents. The XQuery specification is managed by W3C, a consortium formed by several industry leading companies and the academia. The specification has not been fully completed; however, several of the working drafts of the specification have now reached W3C Candidate Recommendation status, bringing the specification as a whole much closer to Final Recommendation. In this section we will provide a few simple XQuery and XPath examples. DB2 commands are directives specific to DB2 that allow you to perform tasks against a DB2 server. There are two types of DB2 commands: System commands Command Line Processor (CLP) commands N O T E SQL statements and DB2 commands can be specified in uppercase or lowercase. However, in Linux or UNIX some of the commands are case-sensitive. XQuery statements are case sensitive. See Appendix B for a detailed explanation of the use of uppercase versus lowercase in DB2.

9 Understanding DB2: Learning Visually with Examples, Second Edition SQL Statements SQL statements allow you to work with the relational and XML data stored in your database. The statements are applied against the database you are connected to, not against the entire DB2 environment. There are three different classes of SQL statements. Data Definition Language (DDL) statements create, modify, or drop database objects. For example CREATE INDEX ix1 ON t1 (salary) ALTER TABLE t1 ADD hiredate DATE DROP VIEW view1 Data Manipulation Language (DML) statements insert, update, delete, or retrieve data from the database objects. For example INSERT INTO t1 VALUES (10, Johnson, Peter ) UPDATE t1 SET lastname = Smith WHERE firstname = Peter DELETE FROM t1 SELECT * FROM t1 WHERE salary > SELECT lastname FROM patients WHERE xmlexists ($p/address[zipcode= ] passing PATIENTS.INFO as p) AND salary > Data Control Language (DCL) statements grant or revoke privileges or authorities to perform database operations on the objects in your database. For example GRANT select ON employee TO peter REVOKE update ON employee FROM paul N O T E SQL statements are commonly referred to simply as statements in most RDBMS books. For detailed syntax of SQL statements, see the DB2 SQL Reference manual XQuery Statements XQuery statements allow you to work with relational and XML data. The statements are applied against the database you are connected to, not against the entire DB2 environment. There are two main ways to work with XML documents in DB2: XPath is part of XQuery. Working with XPath is like working with the Change Directory (cd) command in Linux or MS-DOS. Using the cd operating system command, one can go from one subdirectory to another subdirectory in the directory tree. Similarly, by using the slash (/) in XPath, you can navigate a tree, which represents the

10 36 Chapter 2 DB2 at a Glance: The Big Picture XML document. For example, Figure 2.2 shows an XML document in both serialized format and parsed hierarchical format. <medinfo> <patient type = c1 > <bloodtype>b</bloodtype> <DNA> <genex> </geneX> <geney> </geneY> <genez> </geneZ> </DNA> </patient> <patient type = p2 > <bloodtype> O</bloodType> <DNA> <genex> </geneX> <geney> </geneY> <genez> </geneZ> </DNA> </patient> </medinfo> type=c1 patient bloodtype DNA B genex geney genez medinfo dept type=p2 patient bloodtype DNA O genex geney genez Serialized Format Parsed Hierarchical Format Figure 2.2 An XML document in serialized and parsed hierarchical format Table 2.1 shows some XPath expressions and the corresponding values obtained using the XML document in Figure 2.2 as input. Table 2.1 Sample XPath Expressions XPath expression /medinfo/patient/@type /medinfo/patient/bloodtype /medinfo/patient/dna/genex Value to be returned c1 p2 <bloodtype>b</bloodtype> <bloodtype>o</bloodtype> <genex>555555</genex> <genex>111111</genex> XQuery FLWOR expression. FLWOR stands for: FOR: Iterates through an XML sequence, binds variables to items LET: Assigns an XML sequence to a variable WHERE: Eliminates items of the iteration; use for filtering ORDER: Reorders items of the iteration RETURN: Constructs query results; can return another XML document or even HTML

11 Understanding DB2: Learning Visually with Examples, Second Edition 37 xquery for $g in db2-fn:xmlcolumn('genome.info')/medinfo let $h := cl ]/DNA/geneX/text() return <genexlist> {$h} </genexlist> which would return: <genexlist> </genexlist> For example, using the XML document as shown in Figure 2.2, an XQuery expression could be XPath and XQuery will be discussed in more detail in Chapter 10, Mastering the DB2 purexml Support DB2 System Commands You use DB2 system commands for many purposes, including starting services or processes, invoking utilities, and configuring parameters. Most DB2 system commands do not require the instance the DB2 server engine process to be started (instances are discussed later in this chapter). DB2 system command names have the following format: db2x where x represents one or more characters. For example db2start db2set db2icrt N O T E Many DB2 system commands provide a quick way to obtain syntax and help information about the command by using the -h option. For example, typing db2set h displays the syntax of the db2set command, with an explanation of its optional parameters DB2 Command Line Processor (CLP) Commands DB2 Command Line Processor (CLP) commands are processed by the CLP tool (introduced in the next section). These commands typically require the instance to be started, and they can be used for database and instance monitoring and for parameter configuration. For example list applications create database catalog tcpip node You invoke the Command Line Processor by entering db2 at an operating system prompt. If you enter db2 and press the Enter key, you will be working with the CLP in interactive mode, and

12 38 Chapter 2 DB2 at a Glance: The Big Picture you can enter the CLP commands as shown above. On the other hand, if you don t want to work with the CLP in interactive mode, precede each CLP command with db2. For example db2 list applications db2 create database db2 catalog tcpip node Many books, including this one, display CLP commands as db2 CLP_command for this reason. Chapter 4, Using the DB2 Tools, explains the CLP in greater detail. N O T E On the Windows platform, db2 must be entered in a DB2 Command Window, not at the operating system prompt. The DB2 Command Window and the DB2 CLP are discussed in detail in Chapter 4, Using the DB2 Tools. N O T E A quick way to obtain syntax and help information about a CLP command is to use the question mark (?) character followed by the command. For example db2? catalog tcpip node or just db2? catalog For detailed syntax of a command, see the DB2 Command Reference manual. 2.2 DB2 TOOLS OVERVIEW Figure 2.3 shows all the tools available from the IBM DB2 menu. The IBM DB2 menu on a Windows system can be typically displayed by choosing Start > Programs > IBM DB2. On a Linux or UNIX system, the operating system s graphical support needs to be installed. DB2 s graphical interface looks the same on all platforms. This section briefly introduces the tools presented in the IBM DB2 menu. Chapter 4, Using the DB2 Tools, covers these tools in more detail, but for now simply familiarize yourself with them Command-Line Tools Command-line tools, as the name implies, allow you to issue DB2 commands, SQL statements, and XQuery statements from a command-line interface. The two text-based interfaces are the Command Line Processor (CLP) and the Command Window. The Command Window is available only on Windows, while the CLP is available on all other platforms. The Command Editor is a graphical interface tool that provides the same functionality as the text-based tools and more. It also has the ability to invoke the Visual Explain tool, which shows the access plan for a query.

Understanding DB2: Learning Visually with Examples, Second Edition 39 Figure 2.3 The IBM DB2 menus 2.2.2 General Administration Tools The general administration tools allow database administrators (DBAs) to manage their database servers and databases from a central location.

13 Understanding DB2: Learning Visually with Examples, Second Edition 39 Figure 2.3 The IBM DB2 menus General Administration Tools The general administration tools allow database administrators (DBAs) to manage their database servers and databases from a central location. The Control Center is the most important of these tools. Not only does it support the administration of DB2 database servers on the Linux, UNIX, and Windows platforms, but also on the OS/390 and z/os platforms. From the Control Center, database objects can be created, altered, and dropped. The tool also comes with several advisors to help you configure your system more quickly. The Journal tool can help investigate problems in the system. It tracks error messages and scheduled tasks that have been executed. The License Center summarizes the licenses installed in your DB2 system and allows you to manage them. The Replication Center lets you set up and manage your replication environment. Use DB2 replication when you want to propagate data from one location to another. The Task Center allows you to schedule tasks to be performed automatically. For example, you can arrange for a backup task to run at a time when there is minimal database activity Information Tools The Information menu provides easy access to the DB2 documentation. The Information Center provides a fast and easy method to search the DB2 manuals. You can install the Information Center locally on your computer or intranet server, or access it via the Internet. Use the Check

14 40 Chapter 2 DB2 at a Glance: The Big Picture for DB2 Updates menu option to obtain the most up-to-date information about updates to the DB2 product Monitoring Tools To maintain your database system, DB2 provides several tools that can help pinpoint the cause of a problem or even detect problems proactively before they cause performance deterioration. The Activity Monitor allows you to monitor application performance and concurrency, resource consumption, and SQL statement execution for a database. You can more easily diagnose problems with the reports this tool generates. The Event Analyzer processes the information collected by an event monitor based on the occurrence of an event. For example, when two applications cannot continue their processing because the other is holding resources they need, a deadlock event occurs. This event can be captured by an event monitor, and you can use the Event Analyzer to examine the captured data related to the deadlock and help resolve the contention. Some other events that can be captured are connections to the database, buffer pool activity, table space activity, table activity, SQL statements, and transactions. The Health Center detects problems before they happen by setting up thresholds, which when exceeded cause alert notifications to be sent. The DBA can then choose to execute a recommended action to relieve the situation. The Indoubt Transaction Manager can help resolve issues with transactions that have been prepared but have not been committed or rolled back. This is only applicable to two-phase commit transactions. The Memory Visualizer tool lets you track the memory used by DB2. It plots a graph so you can easily monitor memory consumption Setup Tools The Setup tools help you configure your system to connect to remote servers, provide tutorials, and install add-ins to development tools. The Configuration Assistant allows you to easily configure your system to connect to remote databases and to test the connection. Configure DB2.NET Data Provider allows you to easily configure this provider for.net applications. First Steps is a good starting point for new DB2 users who wish to become familiar with the product. This tool allows you to create a sample database and provides tutorials that help you familiarize yourself with DB2. The Register Visual Studio Add-Ins menu item lets you add a plug-in into Microsoft Visual Studio so that DB2 tools can be invoked from Visual Basic, Visual C++, and the.net development environment. In each of these Microsoft development tools, the add-in inserts the DB2 menu entries into the tool s View, Tools, and Help menus. These

15 Understanding DB2: Learning Visually with Examples, Second Edition 41 add-ins provide Microsoft Visual Studio programmers with a rich set of application development tools to create stored procedures and user-defined functions designed for DB2. Default DB2 and Database Client Interface Selection Wizard (Windows only) allows you to choose the DB2 installation copy to use as the default. Starting with DB2 9 multiple DB2 installations are possible on the same machine, but one of them should be chosen as the default copy. Multiple DB2 copy installation is described in Chapter 3, Installing DB2. This tool will also help you choose the default IBM database client interface (ODBC/CLI driver and.net provider) copy Other Tools The following are other DB2 tools that are not invoked directly from the DB2 menus. Visual Explain describes the access plan chosen by the DB2 optimizer, the brain of DB2, to access and retrieve data from tables. SQL Assist aids new users who are not familiar with the SQL language to write SQL queries. The Satellite Administration Center helps you set up and administer both satellites and the central satellite control server. The IBM Data Studio provides an integrated data management environment which helps you develop and manage database applications throughout the data management lifecycle. Data Studio can be used for design and prototype of projects and queries using Data Modeling diagrams. It comes with an Integrated Query Editor that helps you build SQL and XQuery statements more easily. You can also use it to develop and debug stored procedures, user-defined functions, and Web services. IBM Data Studio supports most IBM Data Servers including Informix. This is a separate, downloadable image based on the Eclipse Integrated Development Environment (IDE). It replaces the Development Center used in previous DB2 versions. The Data Server Administration Console (DSAC) is a Web-based tool that allows you to perform about 80 percent of your administrative tasks. It can be used to manage hundreds of IBM data servers from different platforms such as DB2 for Linux, UNIX and Windows, DB2 for z/os and IDS. It is fast, simple but powerful, and task based. Web-based tools such as the DSAC represent the new generation of tooling that will replace existing tools in the near future. 2.3 THE DB2 ENVIRONMENT Several items control the behavior of your database system. We first describe the DB2 environment on a single-partition database, and in Section 2.6, Database Partitioning Feature (DPF), we expand the material to include concepts relevant to a multipartition database system (we don t want to overload you with information at this stage in the chapter). Figure 2.4 provides an overview of the DB2 environment. Consider the following when you review this figure:

16 42 Chapter 2 DB2 at a Glance: The Big Picture The figure may look complex, but don t be overwhelmed by first impressions! Each item in the figure will be discussed in detail in the following sections. Since we reference Figure 2.4 throughout this chapter, we strongly recommend that you bookmark page 43. This figure is available for free in color as a GIF file (Figure_2_4.gif), on the book s Web site ( Consider printing it. The commands shown in the figure can be issued from the Command Window on Windows or the operating system prompt on Linux and UNIX. Chapter 4, Using the DB2 Tools, describes equivalent methods to perform these commands from the DB2 graphical tools. Each arrow points to a set of three commands. The first command in each set (in blue if you printed the figure using a color printer) inquires about the contents of a configuration file, the second command (in black) indicates the syntax to modify these contents, and the third command (in purple) illustrates how to use the command. The numbers in parentheses in Figure 2.4 match the superscripts in the headings in the following subsections An Instance (1) In DB2, an instance provides an independent environment where databases can be created and applications can be run against them. Because of these independent environments, databases in separate instances can have the same name. For example, in Figure 2.4 the database called MYDB2 is associated to instance DB2, and another database called MYDB2 is associated to instance myinst. Instances allow users to have separate, independent environments for production, test, and development purposes. When DB2 is installed on the Windows platform, an instance named DB2 is created by default. In the Linux and UNIX environments, if you choose to create the default instance, it is called db2inst1. To create an instance explicitly, use db2icrt instance_name To drop an instance, use db2idrop instance_name To start the current instance, use db2start To stop the current instance, use db2stop When an instance is created on Linux and UNIX, logical links to the DB2 executable code are generated. For example, if the server in Figure 2.4 was a Linux or UNIX server and the instances DB2 and myinst were created, both of them would be linked to the same DB2 code. A logical

17 Understanding DB2: Learning Visually with Examples, Second Edition 43 Windows Machine 'Win1' with TCPIP address and DB2 Enterprise 9.5 (Single Partition Environment) Instance DB2 (1) Instance Level Profile Registry (2) Database Manager Configuration File (dbm cfg) (2) System db Directory (3) Node Directory (3) DCS Directory (3) Database MYDB1 (4) Database Configuration File (db cfg) (2) Logs (7) Bufferpool(s) (8) Syscatspace (5) Tempspace1 (5) Userspace1 (5) MyTablespace1 (5) MyTablespace2 (5) Table1 (6) Table2 (6) Table3 (6) Index3 (6) Local db (3) Database MYDB2 (4) Directory Database Configuration File (db cfg) (2) Logs (7) Bufferpool(s) (8) Syscatspace (5) Tempspace1 (5) Userspace1 (5) MyTablespace1 (5) TableX (6) TableY (6) MyTablespaceZ(5) TableZ (6) IndexZ (6) Port Figure 2.4 The DB2 environment Environment Variables (2) Global Level Profile Registry (2) db2set -all db2set <parameter>=<value> -i <instance name> db2set DB2INSTPROF=C:\MYDIR -i MyInst db2 get dbm cfg db2 update dbm cfg using <parameter> <value> db2 update dbm cfg using INTRA_PARALLEL YES db2 list db directory db2 catalog db <db_name> as <alias> at node <nodename> db2 catalog db mydb as yourdb at node mynode db2 list node directory db2 catalog tcpip node <nodename> remote <hostname/ipaddress> server <port> db2 catalog tcpip node mynode remote server db2 list dcs directory db2 catalog dcs db as <location name> db2 catalog dcs db as db1g db2 list db directory on <path> db2 list db directory on C:\DB2 db2 get db cfg for <dbname> db2 update db cfg for <dbname> using <parameter> <value> db2 update db cfg for mydb2 using MINCOMMIT 3 set <parameter> set <parameter>=<value> db2set -all set DB2INSTANCE=DB2 db2set <parameter>=<value> -g db2set DB2INSTPROF=C:\MYDIR -g Instance myinst (1) Instance Level Profile Registry (2) Database Manager Configuration File (dbm cfg) (2) System db Directory (3) Node Directory (3) DCS Directory (3) Database MYDB2 (4) Database Configuration File (db cfg) (2) Logs (7) Bufferpool(s) (8) Syscatspace (5) Tempspace1 (5) Userspace1 (5) MyTablespace1(5) Table1 (6) Table2 (6) Index1 (6) Index2 (6) MyTablespace2 (5) Table3 (6) MyTablespace3(5) Index3 (6) MyTablespace4 (5) LOBs (Large Objects) (6) Port Local db (3) Directory

18 44 Chapter 2 DB2 at a Glance: The Big Picture link works as an alias or pointer to another program. For non-root Linux and UNIX installations as we will see in Chapter 3, Installing DB2, each instance has its own local copy of the DB2 code under the sqllib directory. On Windows, there is a shared install path, and all instances access the same libraries and executables The Database Administration Server The Database Administration Server (DAS) is a daemon or process running on the database server that allows for remote graphical administration from remote clients using the Control Center. If you don t need to administer your DB2 server using a graphical interface from a remote client, you don t need to start the DAS. There can only be one DAS per server regardless of the number of instances or DB2 install copies on the server. Note that the DAS needs to be running at the data server you are planning to administer remotely, not at the DB2 client. If you choose to administer your data servers using the DSAC, the DAS is not required. To start the DAS, use the following command: db2admin start To stop the DAS, use the following command: db2admin stop Configuration Files and the DB2 Profile Registries (2) Like many other RDBMSs, DB2 uses different mechanisms to influence the behavior of the database management system. These include Environment variables DB2 profile registry variables Configuration parameters Environment Variables Environment variables are defined at the operating system level. On Windows you can create a new entry for a variable or edit the value of an existing one by choosing Control Panel > System > Advanced Tab > Environment Variables. On Linux and UNIX you can normally add a line to execute the script db2profile (Bourne or Korn shell) or db2cshrc (C shell) (provided after DB2 installation), to the instance owner s.login or.profile initialization files. The DB2INSTANCE environment variable allows you to specify the current active instance to which all commands apply. If DB2INSTANCE is set to myinst, then issuing the command CREATE DATABASE mydb will create a database associated to instance myinst. If you wanted to create this database in instance DB2, you would first change the value of the DB2INSTANCE variable to DB2. Using the Control Panel (Windows) or the user profile (Linux or UNIX) to set the value of an environment variable guarantees that value will be used the next time you open a window or session. If you only want to change this value temporarily while in a given window or session, you

19 Understanding DB2: Learning Visually with Examples, Second Edition 45 can use the operating system set command on Windows, or export on Linux or UNIX. The command set DB2INSTANCE=DB2 (on Windows) or export DB2INSTANCE=DB2 (on Linux and UNIX) sets the value of the DB2INSTANCE environment variable to DB2. A common mistake when using the command is to leave spaces before and/or after the equal sign (=) no spaces should be entered. To check the current setting of this variable, you can use any of these three commands: echo %DB2INSTANCE% (Windows only) set DB2INSTANCE db2 get instance For a list of all available instances in your system, issue the following command: db2ilist The DB2 Profile Registry The word registry always causes confusion when working with DB2 on Windows. The DB2 profile registry variables, or simply the DB2 registry variables, have no relation whatsoever with the Windows Registry variables. The DB2 registry variables provide a centralized location where some key variables influencing DB2 s behavior reside. N O T E Some of the DB2 registry variables are platform-specific. The DB2 Profile Registry is divided into four categories: The DB2 instance-level profile registry The DB2 global-level profile registry The DB2 instance node-level profile registry The DB2 instance profile registry The first two are the most common ones. The main difference between the global-level and the instance-level profile registries, as you can tell from their names, is the scope to which the variables apply. Global-level profile registry variables apply to all instances on the server. As you can see from Figure 2.4, this registry has been drawn outside of the two instance boxes. Instance-level profile registry variables apply to a specific instance. You can see separate instance-level profile registry boxes inside each of the two instances in the figure. To view the current DB2 registry variables, issue the following command from the CLP: db2set -all

20 46 Chapter 2 DB2 at a Glance: The Big Picture You may get output like this: [i] DB2INSTPROF=C:\PROGRAM FILES\SQLLIB [g] DB2SYSTEM=PRODSYS As you may have already guessed, [i] indicates the variable has been defined at the instance level, while [g] indicates that it has been defined at the global level. The following are a few other commands related to the DB2 Registry variables. To view all the registry variables that can be defined in DB2, use this command: db2set -lr To set the value of a specific variable (in this example, DB2INSTPROF) at the global level, use db2set DB2INSTPROF="C:\PROGRAM FILES\SQLLIB" -g To set a variable at the instance level for instance myinst, use db2set DB2INSTPROF="C:\MY FILES\SQLLIB" -i myinst Note that for the above commands, the same variable has been set at both levels: the global level and the instance level. When a registry variable is defined at different levels, DB2 will always choose the value at the lowest level, in this case the instance level. For the db2set command, as with the set command discussed earlier, there are no spaces before or after the equal sign. Some registry variables require you to stop and start the instance (db2stop/db2start) for the change to take effect. Other registry variables do not have this requirement. Refer to the DB2 Administration Guide: Performance for a list of variables that have this requirement Configuration Parameters Configuration parameters are defined at two different levels: the instance level and the database level. The variables at each level are different (not like DB2 registry variables, where the same variables can be defined at different levels). At the instance level, variables are stored in the Database Manager Configuration file (dbm cfg). Changes to these variables affect all databases associated to that instance, which is why Figure 2.4 shows a Database Manager Configuration file box defined per instance and outside the databases. To view the contents of the Database Manager Configuration file, issue the command: db2 get dbm cfg To update the value of a specific parameter, use db2 update dbm cfg using parameter value For example db2 update dbm cfg using INTRA_PARALLEL YES

21 Understanding DB2: Learning Visually with Examples, Second Edition 47 Many of the Database Manager Configuration parameters are now configurable online, meaning the changes are dynamic you don t need to stop and start the instance. From the Control Center, described in Chapter 4, Using the DB2 Tools, you can access the Database Manager (DBM) Configuration panel (right click on the desired instance, and choose Configure Parameters), which has the column Dynamic that indicates whether the parameter is configurable online or not. In addition, many of these parameters are also Automatic, meaning that DB2 automatically calculates the best value depending on your system. At the database level, parameter values are stored in the Database Configuration file (db cfg). Changes to these parameters only affect the specific database the Database Configuration file applies to. In Figure 2.4 you can see there is a Database Configuration file box inside each of the databases defined. To view the contents of the Database Configuration file, issue the command: db2 get db cfg for dbname For example db2 get db cfg for mydb2 To update the value of a specific variable, use db2 update db cfg for dbname using parameter value For example db2 update db cfg for mydb2 using MINCOMMIT 3 Many of these parameters are configurable online, meaning that the change is dynamic, and you no longer need to disconnect all connections to the database for the change to take effect. From the Control Center, described in Chapter 4, Using the DB2 Tools, you can access the Database Configuration panel (right click on the desired database, and choose Configure Parameters), which has the column Dynamic that indicates whether the parameter is configurable online or not. Many of these parameters are also Automatic, where DB2 configures the parameter for you Connectivity and DB2 Directories (3) In DB2, directories are used to store connectivity information about databases and the servers on which they reside. There are four main directories, which are described in the following subsections. The corresponding commands to set up database and server connectivity are also included; however, many users find the Configuration Assistant graphical tool very convenient to set up database and server connectivity. Chapter 6, Configuring Client and Server Connectivity, discusses all the commands and concepts described in this section in detail, including the Configuration Assistant System Database Directory The system database directory (or system db directory) is like the main table of contents that contains information about all the databases to which you can connect from your DB2 server. As you can see from Figure 2.4, the system db directory is stored at the instance level.

22 48 Chapter 2 DB2 at a Glance: The Big Picture To list the contents of the system db directory, use the command: db2 list db directory Any entry from the output of this command containing the word Indirect indicates that the entry is for a local database, that is, a database that resides on the data server on which you are working. The entry also points to the local database directory indicated by the Database drive item (Windows) or Local database directory (Linux or UNIX). Any entry containing the word Remote indicates that the entry is for a remote database a database residing on a server other than the one on which you are currently working. The entry also points to the node directory entry indicated by the Node name item. To enter information into the system database directory, use the catalog command: db2 catalog db dbname as alias at node nodename For example db2 catalog db mydb as yourdb at node mynode The catalog commands are normally used only when adding information for remote databases. For local databases, a catalog entry is automatically created after creating the database with the CREATE DATABASE command Local Database Directory The local database directory contains information about databases residing on the server where you are currently working. Figure 2.4 shows the local database directory overlapping the database box. This means that there will be one local database directory associated to all of the databases residing in the same location (the drive on Windows or the path on Linux or UNIX). The local database directory does not reside inside the database itself, but it does not reside at the instance level either; it is in a layer between these two. (After you read Section , The Internal Implementation of the DB2 Environment, it will be easier to understand this concept.) Note also from Figure 2.4 that there is no specific command used to enter information into this directory, only to retrieve it. When you create a database with the CREATE DATABASE command, an entry is added to this directory. To list the contents of the local database directory, issue the command: db2 list db directory on drive / path where drive can be obtained from the item Database drive (Windows) or path from the item Local database directory (Linux or UNIX) in the corresponding entry of the system db directory Node Directory The node directory stores all connectivity information for remote database servers. For example, if you use the TCP/IP protocol, this directory shows entries such as the host name or IP address of the server where the database to which you want to connect resides, and the port number of the associated DB2 instance.

23 Understanding DB2: Learning Visually with Examples, Second Edition 49 To list the contents of the node directory, issue the command: db2 list node directory To enter information into the node directory, use db2 catalog tcpip node node_name remote hostname or IP_address server service_name or port_number For example db2 catalog tcpip node mynode remote server You can obtain the port number of the remote instance to which you want to connect by looking at the SVCENAME parameter in the Database Manager Configuration file of that instance. If this parameter contains a string value rather than the port number, you need to look for the corresponding entry in the TCP/IP services file mapping this string to the port number Database Connection Services Directory The Database Connection Services (DCS) directory contains connectivity information for host databases residing on System z (z/os or OS/390) or System i (OS/400) server. You need to have DB2 Connect software installed. To list the contents of the DCS directory, issue the following command: db2 list dcs directory To enter information into the DCS directory, use db2 catalog dcs db dbname as location_name For example db2 catalog dcs db mydb as db1g Databases (4) A database is a collection of information organized into interrelated objects like table spaces, tables, and indexes. Databases are closed and independent units associated to an instance. Because of this independence, objects in two or more databases can have the same name. For example, Figure 2.4 shows a table space called MyTablespace1 inside the database MYDB1 associated to instance DB2. Another table space with the name MyTablespace1 is also used inside the database MYDB2, which is also associated to instance DB2. Since databases are closed units, you cannot perform queries involving tables of two different databases in a direct way. For example, a query involving Table1 in database MYDB1 and TableZ in database MYDB2 is not readily allowed. For an SQL statement to work against tables of different databases, you need to use federation (see Section 2.4, Federation).

24 50 Chapter 2 DB2 at a Glance: The Big Picture You create a database with the command CREATE DATABASE. This command automatically creates three table spaces, a buffer pool, and several configuration files, which is why this command can take a few seconds to complete. N O T E While CREATE DATABASE looks like an SQL statement, it is considered a DB2 CLP command Table Spaces (5) Table spaces are logical objects used as a layer between logical tables and physical containers. Containers are where the data is physically stored in files, directories, or raw devices. When you create a table space, you can associate it to a specific buffer pool (database cache) and to specific containers. Three table spaces SYSCATSPACE (holding the Catalog tables), TEMPSPACE1 (system temporary space), and USERSPACE1 (the default user table space) are automatically created when you create a database. SYSCATSPACE and TEMPSPACE1 can be considered system structures, as they are needed for the normal operation of your database. SYSCATSPACE contains the catalog tables containing metadata (data about your database objects) and must exist at all times. Some other RDBMSs call this structure a data dictionary. N O T E Do not confuse the term catalog in this section with the catalog command mentioned earlier; they have no relationship at all. A system temporary table space is the work area for the database manager to perform operations, such as joins and overflowed sorts. There must be at least one system temporary table space in each database. The USERSPACE1 table space is created by default, but you can delete it. To create a table in a given table space, use the CREATE TABLE statement with the IN table_space_name clause. If a table space is not specified in this statement, the table will be created in the first usercreated table space. If you have not yet created a table space, the table will be created in the USERSPACE1 table space. Figure 2.4 shows other table spaces that were explicitly created with the CREATE TABLESPACE statement (in brown in the figure if you printed the softcopy version). Chapter 8, The DB2 Storage Model, discusses table spaces in more detail Tables, Indexes, and Large Objects (6) A table is an unordered set of data records consisting of columns and rows. An index is an ordered set of pointers associated with a table, and is used for performance purposes and to ensure uniqueness. Non-traditional relational data, such as video, audio, and scanned

25 Understanding DB2: Learning Visually with Examples, Second Edition 51 documents, are stored in tables as large objects (LOBs). Tables and indexes reside in table spaces. Chapter 8 describes these in more detail Logs (7) Logs are used by DB2 to record every operation against a database. In case of a failure, logs are crucial to recover the database to a consistent point. See Chapter 14, Developing Database Backup and Recovery Solutions, for more information about logs Buffer Pools (8) A buffer pool is an area in memory where all index and data pages other than LOBs are processed. DB2 retrieves LOBs directly from disk. Buffer pools are one of the most important objects to tune for database performance. Chapter 8, The DB2 Storage Model, discusses buffer pools in more detail The Internal Implementation of the DB2 Environment We have already discussed DB2 registry variables, configuration files, and instances. In this section we illustrate how some of these concepts physically map to directories and files in the Windows environment. The structure is a bit different in Linux and UNIX environments, but the main ideas are the same. Figures 2.5, 2.6, and 2.7 illustrate the DB2 environment internal implementation that corresponds to Figure 2.4. Figure 2.5 shows the directory where DB2 was installed: H:\Program Files\IBM\SQLLIB. The SQLLIB directory contains several subdirectories and files that belong to DB2, including the binary code that makes DB2 work, and a subdirectory that is created for each instance created on the server. For example, in Figure 2.5 the subdirectories DB2 and MYINST correspond to the instances DB2 and myinst respectively. The DB2DAS00 subdirectory corresponds to the DAS. At the top of the figure there is a directory H:\MYINST. This directory contains all the databases created under the H: drive for instance myinst. Similarly, the H:\DB2 directory contains all the databases created under the H: drive for instance DB2. Figure 2.6 shows an expanded view of the H:\Program Files\IBM\SQLLIB\DB2 directory. This directory contains information about the instance DB2. The db2systm binary file contains the database manager configuration (dbm cfg). The other two files highlighted in the figure (db2nodes.cfg and db2diag.log) are discussed later in this book. For now, the description of these files in the figure is sufficient. The figure also points out the directories where the system database, Node, and DCS directories reside. Note that the Node and DCS directories don t exist if they don t have any entries. In Figure 2.7, the H:\DB2 and H:\MYINST directories have been expanded. The subdirectories SQL00001 and SQL00002 under H:\DB2\NODE0000 correspond to the two databases created

26 52 Chapter 2 DB2 at a Glance: The Big Picture This directory contains all the databases created under the H: drive for instance DB2 This directory contains all the databases created under the H: drive for instance myinst This directory contains all the instance information for instance DB2 This directory contains all the information for the DAS This directory contains all the instance information for instance myinst Figure 2.5 The internal implementation environment for DB2 for Windows under instance DB2. To map these directory names to the actual database names, you can review the contents of the local database directory with this command: list db directory on h:

27 Understanding DB2: Learning Visually with Examples, Second Edition 53 This file is used in a multipartition environment to describe the partitions and servers This file logs DB2 error messages, used by DB2 Tech Support and experienced DBAs Expanding the DB2 Directory System Database Directory DCS Directory Node Directory Database Manager Configuration File (dbm cfg) Figure 2.6 Expanding the DB2 instance directory Chapter 6, Configuring Client and Server Connectivity, shows sample output of this command. Note that the local database directory is stored in the subdirectory SQLDBDIR. This subdirectory is at the same level as each of the database subdirectories; therefore, when a database is dropped, this subdirectory is not dropped. Figure 2.6 shows two SQLDBDIR subdirectories, one under H:\DB2\NODE0000 and another one under H:\MYINST\NODE0000. Knowing how the DB2 environment is internally implemented can help you understand the DB2 concepts better. For example, looking back at Figure 2.4 (the one you should have printed!), what would happen if you dropped the instance DB2? Would this mean that databases MYDB1 and MYDB2 are also dropped? The answer is no. Figure 2.5 clearly shows that the directory where the instance information resides (H:\Program Files\IBM\SQLLIB\DB2) and the directory where the data resides (H:\DB2) are totally different. When an instance is dropped, only the subdirectory created for that instance is dropped. Similarly, let s say you uninstall DB2 at a given time, and later you reinstall it on the same drive. After reinstallation, can you access the old databases created before you uninstalled DB2 the first time? The answer is yes. When you uninstalled DB2, you removed the SQLLIB directory, therefore the DB2 binary code, as well as the instance subdirectories, were removed, but the databases were left untouched. When you reinstall DB2, a new SQLLIB directory is created with a new default DB2 instance; no other instance is created. The new DB2 instance will have a new empty system database directory (db2systm). So even though the directories containing the database data were left intact, you need to explicitly put the information in the DB2 system

28 54 Chapter 2 DB2 at a Glance: The Big Picture Expanding the DB2 Directory Expanding the MYINST Directory This is the Database Configuration File (db cfg) for database SQL00001 under instance DB2 This is the Local Database Directory for the databases under Instance MYINST This is the Local Database Directory for the databases under Instance DB2 Figure 2.7 Expanding the directories containing the database data database directory for DB2 to recognize the existence of these databases. For example, if you would like to access the MYDB1 database of the DB2 instance, you need to issue this command to add an entry to the system database directory: catalog db mydb1 on h: If the database you want to access is MYDB2, which was in the myinst instance, you would first need to create this instance, switch to the instance, and then issue the catalog command as shown below. db2icrt myinst set DB2INSTANCE=myinst catalog db mydb2 on h: It is a good practice to back up the contents of all your configuration files as shown below. db2 get dbm cfg > dbmcfg.bk db2 get db cfg for database_name > dbcfg.bk db2set -all > db2set.bk db2 list db directory > systemdbdir.bk db2 list node directory > nodedir.bk db2 list dcs directory > dcsdir.bk

29 Understanding DB2: Learning Visually with Examples, Second Edition 55 Notice that all of these commands redirect the output to a text file with a.bk extension. C A U T I O N The purpose of this section is to help you understand the DB2 environment by describing its internal implementation. We strongly suggest that you do not tamper with the files and directories discussed in this section. You should only modify the files using the commands described in earlier sections. 2.4 FEDERATION Database federated support in DB2 allows tables from multiple databases to be presented as local tables to a DB2 server. The databases may be local or remote; they can also belong to different RDBMSs. While Chapter 1, Introduction to DB2, briefly introduced federated support, this section provides an overview of how federation is implemented. First of all, make sure that your server allows federated support: The database manager parameter FEDERATED must be set to YES. DB2 uses NICKNAME, SERVER, WRAPPER, and USER MAPPING objects to implement federation. Let s consider the example illustrated in Figure 2.8. The DB2 user db2user connects to the database db2db. He then issues the following statement: SELECT * FROM remote_sales The table remote_sales, however, is not a local table but a nickname, which is a pointer to a table in another database, possibly in another server and from a different RDBMS. A nickname is created with the CREATE NICKNAME statement, and requires a SERVER object (aries in the example) and the schema and table name to be accessed at this server (csmmgr.sales). A SERVER object is associated to a WRAPPER. A wrapper is associated to a library that contains all the code required to connect to a given RDBMS. For IBM databases like Informix, these wrappers or libraries are provided with DB2. For other RDBMSs, you need to obtain the IBM Websphere Federation Server software. In Figure 2.8, the wrapper called informix was created, and it is associated to the library db2informix.dll. To access the Informix table csmmgr.sales, however, you cannot use the DB2 user id and password directly. You need to establish a mapping between the DB2 user id and an Informix user id that has the authority to access the desired table. This is achieved with the CREATE USER MAPPING statement. Figure 2.8 shows how the DB2 user db2user and the Informix user informixuser are associated with this statement.

30 56 Chapter 2 DB2 at a Glance: The Big Picture DB2 CONNECT TO db2db USER db2user USING db2psw SELECT * FROM remote_sales CREATE NICKNAME remote_sales FOR "aries"."csmmgr"."sales" CREATE USER MAPPING FOR "db2user" SERVER "aries" OPTIONS (REMOTE_AUTHID 'informixuser' REMOTE_PASSWORD 'informixpsw') CREATE SERVER "aries"wrapper "informix"... CREATE WRAPPER "informix" LIBRARY 'db2informix.dll' (A wrapper includes the client code of the other RDBMS) INFORMIX CONNECT TO informixdb userid = informixuser password = informixpsw SELECT * FROM csmmgr.sales Local connection from Informix client to Informix database Informix Database informixdb Table csmmgr.sales Figure 2.8 An overview of a federation environment 2.5 CASE STUDY: THE DB2 ENVIRONMENT N O T E Several assumptions have been made in this case study and the rest of the case studies in this book, so if you try to follow them, some steps may not work for you. If you do follow some or all of the steps in the case studies, we recommend you use a test computer system. You recently attended a DB2 training class and would like to try things out on your own laptop at the office. Your laptop is running Windows Vista and DB2 Enterprise has been installed. You open the Command Window and take the following steps. 1. First, you want to know how many instances you have in your computer, so you enter db2ilist

31 Understanding DB2: Learning Visually with Examples, Second Edition Then, to find out which of these instances is the current active one, you enter db2 get instance With the db2ilist command, you found out there were two instances defined on this computer, DB2 and myinst. With the db2 get instance command, you learned that the DB2 instance is the current active instance. 3. You would now like to list the databases in the myinst instance. Since this one is not the current active instance, you first switch to this instance temporarily in the current Command Window: set DB2INSTANCE=myinst 4. You again issue db2 get instance to check that myinst is now the current instance, and you start it using the db2start command. 5. To list the databases defined on this instance you issue db2 list db directory This command shows that you only have one database (MYDB2) in this instance. 6. You want to try creating a new database called TEMPORAL, so you execute db2 create database temporal The creation of the database takes some time because several objects are created by default inside the database. Issuing another list db directory command now shows two databases: MYDB2 and TEMPORAL. 7. You connect to the MYDB2 database (db2 connect to mydb2) and check which tables you have in this database (db2 list tables for all). You also check how many table spaces are defined (db2 list tablespaces). 8. Next, you want to review the contents of the database configuration file (db cfg) for the MYDB2 database: db2 get db cfg for mydb2 9. To review the contents of the Database Manager Configuration file (dbm cfg) you issue db2 get dbm cfg 10. At this point, you want to practice changing the value of a dbm cfg parameter, so you pick the INTRA_PARALLEL parameter, which has a value set to NO. You change its value to YES as follows: db2 update dbm cfg using INTRA_PARALLEL YES 11. You learned at the class that this parameter is not configurable online, so you know you have to stop and start the instance. Since there is a connection to a database in the current instance (remember you connected to the MYDB2 database earlier from your current Command Window), DB2 will not allow you to stop the instance. Enter the following sequence of commands: db2 terminate (terminates the connection) db2stop db2start

32 58 Chapter 2 DB2 at a Glance: The Big Picture And that s it! In this case study you have reviewed some basic instance commands like db2ilist and get instance. You have also reviewed how to switch to another instance, create and connect to a database, list the databases in the instance, review the contents of the database configuration file and the database manager configuration file, update a database manager configuration file parameter, and stop and start an instance. 2.6 DATABASE PARTITIONING FEATURE In this section we introduce you to the Database Partitioning Feature (DPF) available on DB2 Enterprise. DPF lets you partition your database across multiple servers or within a large SMP server. This allows for scalability, since you can add new servers and spread your database across them. That means more CPUs, more memory, and more disks from each of the additional servers for your database! DB2 Enterprise with DPF is ideal to manage large databases, whether you are doing data warehousing, data mining, online analytical processing (OLAP), also known as Decision Support Systems (DSS), or working with online transaction processing (OLTP) workloads. You do not have to install any new code to enable this feature, but you must purchase this feature before enabling the Database Partitioning Feature. Users connect to the database and issue queries as usual without the need to know that the database is spread among several partitions. Up to this point, we have been discussing a single partition environment and its concepts, all of which apply to a multipartition environment as well. We will now point out some implementation differences and introduce a few new concepts, including database partitions, partition groups, and the coordinator partition, that are relevant only to a multipartition environment Database Partitions A database partition is an independent part of a partitioned database with its own data, configuration files, indexes, and transaction logs. You can assign multiple partitions across several physical servers or to a single physical server. In the latter case, the partitions are called logical partitions, and they can share the server s resources. A single-partition database is a database with only one partition. We described the DB2 environment for this type of database in Section 2.3, The DB2 Environment. A multipartition database (also referred to as a partitioned database) is a database with two or more database partitions. Depending on your hardware environment, there are several topologies for database partitioning. Figure 2.9 shows single partition and multipartition configurations with one partition per server. The illustration at the top of the figure shows an SMP server with one partition (single-partition environment). This means the entire database resides on this one server.

33 Understanding DB2: Learning Visually with Examples, Second Edition 59 SMP machine Single-Partition Configuration Single-partition on a Symmetric Multiprocessor (SMP) machine CPUs Memory Database Partition Disk Communication Facility SMP machine SMP machine Multipartition Configuration (one partition per machine) Multiple partitions on a cluster of SMP machines (also known as a Massively Parallel Processing (MPP) environment) CPUs Memory Database Partition CPUs Memory Database Partition Disk Disk Figure 2.9 Database partition configurations with one partition per server The illustration at the bottom shows two SMP servers, one partition per server (multipartition environment). This means the database is split between the two partitions. N O T E In Figure 2.9, the symmetric multiprocessor (SMP) systems could be replaced by uniprocessor systems. Figure 2.10 shows multipartition configurations with multiple partitions per server. Unlike Figure 2.9 where there was only one partition per server, this figure illustrates two (or more) partitions per server. To visualize how a DB2 environment is split in a DPF system, Figure 2.11 illustrates a partial reproduction of Figure 2.4, and shows it split into three physical partitions, one partition per

34 60 Chapter 2 DB2 at a Glance: The Big Picture Multipartition Configurations Partitions (several partitions per machine) Big SMP machine CPUs Multiple partitions on an SMP machine Memory Communication Facility Database Partition 1 Database Partition 2 Multiple partitions on a cluster of SMP machines (MPP) Disk Communication Facility Big SMP machine Big SMP machine CPUs CPUs Memory Communication Facility Memory Communication Facility Database Partition 1 Database Partition 2 Database Partition 3 Database Partition 4 Disk Disk Figure 2.10 Database partition configurations with multiple partitions per server server. (We have changed the server in the original Figure 2.4 to use the Linux operating system instead of the Windows operating system.) N O T E Because we reference Figure 2.11 throughout this section, we recommend that you bookmark page 61. Alternatively, since this figure is available in color on the book s Web site, title/ (Figure_2_11.gif), consider printing it.

35 Understanding DB2: Learning Visually with Examples, Second Edition 61 Linux Server Linux1 with TCPIP address and DB2 Enterprise 9.5 E n v i r o n m e n t V a r i a b l e s G l o b a l L e v e l P r o f i l e R e g i s t r y Instance MyInst Instance Level Profile Registry Databse Manager Config File (dbm cfg) System db Directory Node Directory DCS Directory Database MYDB2 D a t a b a s e C o n f i g F i l e ( d b c f g ) L o c a l d b D i r e c t o r y B u f f e r p o o l ( s ) L o g s Syscatspace T e m p s p a c e 1 U s e r s p a c e 1 M y T a b l e s p a c e 1 T a b l e 1 T a b l e 2 I n d e x 1 I n d e x 2 M y T a b l e s p a c e 2 T a b l e 3 M y T a b l e s p a c e 3 I n d e x 3 P o r t Linux Server Linux1 Linux Server Linux2 Linux Server Linux3 (NFS source server) Figure 2.11 The DB2 environment on DB2 Enterprise with DPF In Figure 2.11, the DB2 environment is split so that it now resides on three servers running the same operating system (Linux, in this example). The partitions are also running the same DB2 version, but it is important to note that different Fix Pack levels are allowed. This figure shows where files and objects would be located on a new installation of a multipartition system.

36 62 Chapter 2 DB2 at a Glance: The Big Picture It is also important to note that all of the servers participating in a DPF environment have to be interconnected by a high-speed communication facility that supports the TCP/IP protocol. TCP/ IP ports are reserved on each server for this interpartition communication. For example, by default after installation, the services file on Linux (/etc/services) is updated as follows (assuming you chose to create the db2inst1 instance): DB2_db2inst /tcp DB2_db2inst1_ /tcp DB2_db2inst1_ /tcp DB2_db2inst1_END 60003/tcp db2c_db2inst /tcp This will vary depending on the number of partitions on the server. By default, ports through are reserved for interpartition communication. You can update the services file with the correct number of entries to support the number of partitions you are configuring. When the partitions reside on the same server, communication between the partitions still requires this setup. For a DB2 client to connect to a DPF system, you issue catalog commands at the client to populate the system and node directories. In the example, the port number to use in these commands is to connect to the db2inst1 instance, and the host name can be any of the servers participating in the DPF environment. The server used in the catalog command becomes the coordinator, unless the DBPARTITIONNUM option of the connect statement is used. The concept of coordinator is described later in this section. Chapter 6, Configuring Client and Server Connectivity, discusses the catalog command in detail. N O T E Each of the servers participating in the DPF environment have their own separate services file, but the entries in those files that are applicable to DB2 interpartition communication must be the same The Node Configuration File The node configuration file (db2nodes.cfg) contains information about the database partitions and the servers on which they reside that belong to an instance. Figure 2.12 shows an example of the db2nodes.cfg file for a cluster of four UNIX servers with two partitions on each server. In Figure 2.12, the partition number, the first column in the db2nodes.cfg file, indicates the number that identifies the database partition within DB2. You can see that there are eight partitions in total. The numbering of the partitions must be in ascending order, can start from any number, and gaps between the numbers are allowed. The numbering used is important as it will be taken into consideration in commands or SQL statements.

37 Understanding DB2: Learning Visually with Examples, Second Edition 63 DB2 Instance-Owning Machine...sqllib/db2nodes.cfg 0 myservera 0 1 myservera 1 2 myserverb 0 3 myserverb 1 4 myserverc 0 5 myser verc 1 6 myserverd 0 7 myserverd 1 myserverb Machine CPUs Memory Communication Facility Database Partition 2 Database Partition 3 Port0 Port1 Partition Number Hostname or IP Address Logical Port Disk Figure 2.12 An example of the db2nodes.cfg file The second column is the hostname or TCP/IP address of the server where the partition is created. The third column, the logical port, is required when you create more than one partition on the same server. This column specifies the logical port for the partition within the server and must be unique within a server. In Figure 2.12, you can see the mapping between the db2nodes.cfg entries for partitions 2 and 3 for server myserverb and the physical server implementation. The logical ports must also be in the same order as in the db2nodes.cfg file. The fourth column in the db2nodes.cfg file, the netname, is required if you are using a highspeed interconnect for interpartition communication or if the resourcesetname column is used. The fifth column in the db2nodes.cfg file, the resourcesetname, is optional. It specifies the operating system resource that the partition should be started in. On Windows, the db2nodes.cfg file uses the computer name column instead of the resourcesetname column. The computer name column stores the computer name for the server on which a partition resides. Also, the order of the columns is slightly different: partition number, hostname, computer name, logical port, netname, and resourcesetname. The db2nodes.cfg file must be located Under the SQLLIB directory for the instance owner on Linux and UNIX Under the SQLLIB\instance_name directory on Windows In Figure 2.11 this file would be on the Linux3 server, as this server is the Network File System (NFS) source server, the server whose disk(s) can be shared.

38 64 Chapter 2 DB2 at a Glance: The Big Picture On Linux and UNIX you can edit the db2nodes.cfg file with any ASCII editor or use DB2 commands to update the file. On Windows, you can only use the db2ncrt and db2ndrop commands to create and drop database partitions; the db2nodes.cfg file should not be edited directly. For any platform, you can also use the db2start command to add and or remove a database partition from the DB2 instance and update the db2nodes.cfg file using the add dbpartitionnum and the drop dbpartitionnum clauses respectively An Instance in the DPF Environment Partitioning is a concept that applies to the database, not the instance; you partition a database, not an instance. In a DPF environment an instance is created once on an NFS source server. The instance owner s home directory is then exported to all servers where DB2 is to be run. Each partition in the database has the same characteristics: the same instance owner, password, and shared instance home directory. On Linux and UNIX, an instance maps to an operating system user; therefore, when an instance is created, it will have its own home directory. In most installations /home/user_name is the home directory. All instances created on each of the participating servers in a DPF environment must use the same name and password. In addition, you must specify the home directory of the corresponding operating system user to be the same directory for all instances, which must be created on a shared file system. Figure 2.13 illustrates an example of this. In Figure 2.13, the instance myinst has been created on the shared file system, and myinst maps to an operating system user of the same name, which in the figure has a home directory of /home/ myinst. This user must be created separately in each of the participating servers, but they must share the instance home directory. As shown in Figure 2.13, all three Linux servers share directory /home/myinst, which resides on a shared file system local to Linux3. Since the instance owner directory is locally stored on the Linux3 server, this server is considered to be the DB2 instance-owning server. Linux1 server Instance myinst has home directory /home/myinst Linux2 server Instance myinst has home directory /home/myinst Linux3 server Instance myinst has home directory /home/myinst Local Disk Local Disk /home/db2as Local Disk Local Disk /home/db2as Shared File System: /home/myinst /sqllib... Local Disk /home/db2as Figure 2.13 An instance in a partitioned environment

39 Understanding DB2: Learning Visually with Examples, Second Edition 65 Figure 2.13 also shows that the Database Administration Server user db2as is created locally on each participating server in a DPF environment. There can only be one DAS per physical server regardless of the number of partitions that server contains. The DAS user s home directory cannot be mounted on a shared file system. Alternatively, different userids and passwords can be used to create the DAS on different servers. N O T E Make sure the passwords for the instances are the same on each of the participating servers in a DPF environment; otherwise the partitioned system will look like it is hanging because the partitions are not able to communicate Partitioning a Database When you want to partition a database in a DPF environment, simply issue the CREATE DATA- BASE command as usual. For example, if the instance owner home directory is /home/myinst, when you execute this command: CREATE DATABASE mydb2 the structure created is as shown in Figure /home /myinst /NODE0000 /SQL00001 /NODE0001 /SQL00001 /NODE0002 /SQL00001 Figure 2.14 A partitioned database in a single file system If you don t specify a path in your CREATE DATABASE command, by default the database is created in the directory specified by the database manager configuration parameter DFTDB- PATH, which defaults to the instance owner s home directory. This partitioning is not optimal because all of the database data would reside in one file system that is shared by the other servers across a network. We recommend that you create a directory with the same name, locally in each of the participating servers. For the environment in Figure 2.13, let s assume the directory /data has been created locally on each server. When you execute the command CREATE DATABASE mydb2 on /data the following directory structure is automatically built for you: /data/instance_name/nodexxxx/sqlyyyyy

40 66 Chapter 2 DB2 at a Glance: The Big Picture The /data directory is specified in the CREATE DATABASE command, but the directory must exist before executing the command. instance_name is the name of the instance; for example, myinst. NODExxxx distinguishes which partition you are working with, where xxxx represents the number of the partition specified in the db2nodes.cfg file. SQLyyyyy identifies the database, where yyyyy represents a number. If you have only one database on your system, then yyyyy is equal to 00001; if you have three databases on your system, you will have different directories as follows: SQL00001, SQL00002, SQL To map the database names to these directories, you can review the local database directory using the following command: list db directory on /data Inside the SQLyyyyy directories are subdirectories for table spaces, and within them, files containing database data assuming all table spaces are defined as system-managed space (SMS). Figure 2.15 illustrates a partitioned database created in the /data directory. Linux1 Server Instance myinst has home directory /home/myinst Linux2 Server Instance myinst has home directory /home/myinst Linux3 Server Instance myinst has home directory /home/myinst Local Disk Local Disk /data /myinst /NODE0000 /SQL00001 Local Disk Local Disk /data /myinst /NODE0001 /SQL00001 Shared File System: /home/myinst /sqllib... Local Disk /data /myinst /NODE0002 /SQL00001 Figure 2.15 A partitioned database across several file systems N O T E Before creating a database, be sure to change the value of the dbm cfg parameter, DFTDBPATH, to an existing path created locally with the same name on each of the participating servers of your DPF system. Alternatively, make sure to include this path in your CREATE DATABASE command. Similarly, to create the SAMPLE database, specify this path in the command: db2sampl path Partitioning a database is described in more detail in Chapter 8, The DB2 Storage Model.

41 Understanding DB2: Learning Visually with Examples, Second Edition Configuration Files in a DPF Environment As shown in Figure 2.11, the Database Manager Configuration file (dbm cfg), system database directory, node directory, and DCS directory are all part of the instance-owning server and are not partitioned. What about the other configuration files? Environment variables: Each participating server in a partitioned environment can have different environment variables. Global-level profile registry variable: This is stored in a file called default.env that is located in a subdirectory under the /var directory. There is a local copy of this file on each server. Database configuration file: This is stored in the file SQLDBCON that is located in the SQLyyyyy directory for the database. In a partitioned database environment, a separate SQLDBCON file is created for each partition in every database. The local database directory: This is stored in the file SQLDBDIR in the corresponding directory for the database. It has the same name as the system database directory, which is located under the instance directory. A separate SQLDBDIR file exists for each partition in each database. C A U T I O N We strongly suggest you do not manually edit any of the DB2 configuration files. You should modify the files using the commands described in earlier sections. N O T E The values of the global-level profile registry variables, database configuration file parameters, and local database directory entries should be the same for each database partition Logs in a DPF Environment The logs on each database partition should be kept in a separate place. The database configuration parameter Path to log files (LOGPATH) on each partition should point to a local file system, not a shared file system. The default log path in each partition includes a NODE000x subdirectory. For example, the value of this parameter in the DPF system shown in Figure 2.11 could be: For Partition 0: /datalogs/db2inst1/node0000/sql00001/sqlogdir/ For Partition 1: /datalogs/db2inst1/node0001/sql00001/sqlogdir/ For Partition 2: /datalogs/db2inst1/node0002/sql00001/sqlogdir/ To change the path for the logs, update the database configuration parameter NEWLOGPATH.

42 68 Chapter 2 DB2 at a Glance: The Big Picture The Catalog Partition As stated previously, when you create a database, several table spaces are created by default. One of them, the catalog table space SYSCATSPACE, contains the DB2 system catalogs. In a partitioned environment, SYSCATSPACE is not partitioned but resides on one partition known as the catalog partition. The partition from which the CREATE DATABASE command is issued becomes the catalog partition for the new database. All access to system tables must go through this catalog partition. Figure 2.11 shows SYSCATSPACE residing on server Linux1, so the CREATE DATABASE command was issued from this server. For an existing database, you can determine which partition is the catalog partition by issuing the command list db directory. The output of this command has the field Catalog database partition number for each of the entries, which indicates the catalog partition number for that database Partition Groups A partition group is a logical layer that provides for the grouping of one or more database partitions. A database partition can belong to more than one partition group. When a database is created, DB2 creates three default partition groups, and these partition groups cannot be dropped. IBMDEFAULTGROUP: This is the default partition group for any table you create. It contains all database partitions defined in the db2nodes.cfg file. This partition group cannot be modified. Table space USERSPACE1 is created in this partition group. IBMTEMPGROUP: This partition group is used by all system temporary tables. It contains all database partitions defined in the db2nodes.cfg file. Table space TEMPSPACE1 is created in this partition group. IBMCATGROUP: This partition group contains the catalog tables (table space SYSCATSPACE). It only includes the database s catalog partition. This partition group cannot be modified. To create new database partition groups, use the CREATE DATABASE PARTITION GROUP statement. This statement creates the database partition group within the database, assigns database partitions that you specified to the partition group, and records the partition group definition in the database system catalog tables. The following statement creates partition group pgrpall on all partitions specified in the db2nodes.cfg file: CREATE DATABASE PARTITION GROUP pgrpall ON ALL DBPARTITIONNUMS To create a database partition group pg23 consisting of partitions 2 and 3, issue this command: CREATE DATABASE PARTITION GROUP pg23 ON DBPARTITIONNUMS (2,3) Other relevant partition group statements/commands are:

43 Understanding DB2: Learning Visually with Examples, Second Edition 69 ALTER DATABASE PARTITION GROUP (statement to add or drop a partition in the group) DROP DATABASE PARTITION GROUP (statement to drop a partition group) LIST DATABASE PARTITION GROUPS (command to list all your partition groups; note that IBMTEMPGROUP is never listed) Buffer Pools in a DPF Environment Figure 2.11 shows buffer pools defined across all of the database partitions. Interpreting this figure for buffer pools is different than for the other objects, because the data cached in the buffer pools is not partitioned as the figure implies. Each buffer pool in a DPF environment holds data only from the database partition where the buffer pool is located. You can create a buffer pool in a partition group using the CREATE BUFFERPOOL statement with the DATABASE PARTITION GROUP clause. This means that you have the flexibility to define the buffer pool on the specific partitions defined in the partition group. In addition, the size of the buffer pool on each partition in the partition group can be different. The following statement will create buffer pool bpool_1 in partition group pg234, which consists of partitions 2, 3, and 4. CREATE BUFFERPOOL bpool_1 DATABASE PARTITION GROUP pg234 SIZE EXCEPT ON DBPARTITIONNUM (3 TO 4) SIZE 5000 Partition 2 in partition group pg234 will have a buffer pool bpool_1 defined with a size of 10,000 pages, and Partitions 3 and 4 will have a buffer pool of size 5,000 pages. As an analogy, think of it as if you were issuing the CREATE BUFFERPOOL statement on each partition separately, with the same buffer pool name for each partition but with different sizes. That is: On partition 2: CREATE BUFFERPOOL bpool_1 SIZE On partition 3: CREATE BUFFERPOOL bpool_1 SIZE 5000 On partition 4: CREATE BUFFERPOOL bpool_1 SIZE 5000 Note that we use these statements only to clarify the analogy; they will not work as written. Executing each of these commands as shown will attempt to create the same buffer pool on all partitions. It is not equivalent to using the DATABASE PARTITION GROUP clause of the CREATE BUFFERPOOL statement. Buffer pools can also be associated to several partition groups. This means that the buffer pool definition will be applied to the partitions in those partition groups.

44 70 Chapter 2 DB2 at a Glance: The Big Picture Table Spaces in a Partitioned Database Environment You can create a table space in specific partitions, associating it to a partition group, by using the CREATE TABLESPACE statement with the IN DATABASE PARTITION GROUP clause. This allows users to have flexibility as to which partitions will actually be storing their tables. In a partitioned database environment with three servers, one partition per server, the statement CREATE TABLESPACE mytbls IN DATABASE PARTITION GROUP pg234 MANAGED BY SYSTEM USING ( /data ) BUFFERPOOL bpool_1 creates the table space mytbls, which spans partitions 2, 3, and 4 (assuming pg234 is a partition group consisting of these partitions). In addition, the table space is associated with buffer pool bpool_1 defined earlier. Note that creating a table space would fail if you provide conflicting partition information between the table space and the associated buffer pool. For example, if bpool_1 was created for partitions 5 and 6, and table space mytbls was created for partitions 2, 3, and 4, you would get an error message when trying to create this table space The Coordinator Partition In simple terms, the coordinator partition is the partition where the application connects to. In general, each database connection has a corresponding DB2 agent handling the application connection. An agent can be thought of as a process (Linux or UNIX) or thread (Windows) that performs DB2 work on behalf of the application. There are different types of agents. One of them, the coordinator agent, communicates with the application, receiving requests and sending replies. It can either satisfy the request itself or delegate the work to multiple subagents to work on the request. The coordinator partition of a given application is the partition where the coordinator agent exists. You use the SET CLIENT CONNECT_NODE command to set the partition that is to be the coordinator partition. Any partition can potentially be a coordinator, so in Figure 2.11 we do not label any particular partition as the coordinator node. If you would like to know more about DB2 agents and the DB2 process model, refer to Chapter 15, The DB2 Process Model Issuing Commands and SQL Statements in a DPF Environment Imagine that you have twenty physical servers, with two database partitions on each. Issuing individual commands to each physical server or partition would be quite a task. Fortunately, DB2 provides a command that executes on all database partitions The db2_all command Use the db2_all command when you want to execute a command or SQL statement against all database partitions. For example, to change the db cfg parameter LOGFILSIZ for the database sample in all partitions, you would use

45 Understanding DB2: Learning Visually with Examples, Second Edition 71 db2_all ";db2 UPDATE DB CFG FOR sample USING LOGFILSIZ 500" When the semicolon (;) character is placed before the command or statement, the request runs in parallel on all partitions. N O T E In partitioned environments, the operating system command rah performs commands on all servers simultaneously. The rah command works per server, while the db2_all command works per database partition. The rah and db2_all commands use the same characters. For more information about the rah command, refer to your operating system manuals. In partitioned environments, purexml support is not yet available; therefore, this section does not mention XQuery statements Using Database Partition Expressions In a partitioned database, database partition expressions can be used to generate values based on the partition number found in the db2nodes.cfg file. This is particularly useful when you have a large number of database partitions and when more than one database partition resides on the same physical server, because the same device or path cannot be specified for all partitions. You can manually specify a unique container for each database partition or use database partition expressions. The following example illustrates the use of database partition expressions. The following are sample contents of a db2nodes.cfg file (in a Linux or UNIX environment): 0 myservera 0 1 myservera 1 2 myserverb 0 3 myserverb 1 This shows two servers with two database partitions each. The command CREATE TABLESPACE ts2 MANAGED BY DATABASE USING (file /data/ts2/container $N ) creates the following containers: /data/ts2/container100 on database partition 0 /data/ts2/container101 on database partition 1 /data/ts2/container102 on database partition 2 /data/ts2/container103 on database partition 3 You specify a database partition expression with the argument $N (note that there must be a space before $N in the command). Table 2.2 shows other arguments for creating containers. Operators are evaluated from left to right, and % represents the modulus (the remainder of a division). Assuming the partition number to be evaluated is 3, the value column in Table 2.2 shows the result of resolving the database partition expression.

46 72 Chapter 2 DB2 at a Glance: The Big Picture Table 2.2 Database Partition Expressions Database Partition Expressions Example Value [blank]$n $N 3 [blank]$n+[number] $N [blank]$n%[number] $N%2 1 [blank]$n+[number]%[number] $N+15%13 5 [blank]$n%[number]+[number] $N% The DB2NODE Environment Variable In Section 2.3, The DB2 Environment, we talked about the DB2INSTANCE environment variable used to switch between instances in your database system. The DB2NODE environment variable is used in a similar way, but to switch between partitions on your DPF system. By default, the active partition is the one defined with the logical port number of zero (0) in the db2nodes.cfg file for a server. To switch the active partition, change the value of the DB2NODE variable using the SET command on Windows and the export command on Linux or UNIX. Be sure to issue a terminate command for all connections from any partition to your database after changing this variable, or the change will not take effect. Using the settings for the db2nodes.cfg file shown in Table 2.3, you have four servers, each with two logical partitions. If you log on to server myserverb, any commands you execute will affect partition 2, which is the one with logical port zero on that server, and the default coordinator partition for that server. Table 2.3 Sample Partition Information Partition Server Name Logical Port 0 Myservera 0 1 Myservera 1 2 Myserverb 0 3 Myserverb 1 4 Myserverc 0 5 Myserverc 1 6 Myserverd 0 7 Myserverd 1

47 Understanding DB2: Learning Visually with Examples, Second Edition 73 If you would like to make partition 0 the active partition, make this change on a Linux or UNIX system: DB2NODE=0 export DB2NODE db2 terminate N O T E You must issue the terminate command, even if there aren t any connections to any partitions. Note that partition 0 is on server myservera. Even if you are connected to myserverb, you can make a partition on myservera the active one. To determine which is your active partition, you can issue this statement after connecting to a database: db2 "values (current dbpartitionnum)" Distribution Maps and Distribution Keys By now you should have a good grasp of how to set up a DPF environment. It is now time to understand how DB2 distributes data across the partitions. Figure 2.16 shows an example of this distribution. Distribution key for table mytable col1, col2, col3 Hashing algorithm Distribution map for partitioning p0 p1 p2 p3 p0 p1 p2 p3... p0 p1 p2 p3 group pg Figure 2.16 Distributing data rows in a DPF environment A distribution map is an internally generated array containing 4,096 entries for multipartition database partition groups or a single entry for single-partition database partition groups. The partition numbers of the database partition group are specified in a round-robin fashion in the array. A distribution key is a column (or group of columns) that determines the partition on which a particular row of data is physically stored. You define a distribution key explicitly using the CREATE TABLE statement with the DISTRIBUTE BY clause.

48 74 Chapter 2 DB2 at a Glance: The Big Picture When you create or modify a database partition group, a distribution map is associated with it. A distribution map in conjunction with a distribution key and a hashing algorithm determine which database partition will store a given row of data. For the example in Figure 2.16, let s assume partition group pg0123 has been defined on partitions 0, 1, 2, and 3. An associated distribution map is automatically created. This map is an array with 4,096 entries containing the values 0, 1, 2, 3, 0, 1, 2, (note that this is shown in Figure 2.16 as p0, p1, p2, p3, p0, p1, p2, p3... to distinguish them from the array entry numbers). Let s also assume table mytable has been created with a distribution key consisting of columns col1, col2, and col3. For each row, the distribution key column values are passed to the hashing algorithm, which returns an output number from 0 to 4,095. This number corresponds to one of the entries in the array that contains the value of the partition number where the row is to be stored. In Figure 2.16, if the hashing algorithm had returned an output value of 7, the row would have been stored in partition p3. V9 N O T E Prior to DB2 9 the DISTRIBUTION MAP and DISTRIBU- TION KEY terms were known as PARTITIONING MAP and PARTI- TIONING KEY respectively. 2.7 CASE STUDY: DB2 WITH DPF ENVIRONMENT Now that you are familiar with DPF, let s review some of the concepts discussed using a simple case study. Your company is expanding, and it recently acquired two other firms. Since the amount of data will be increased by approximately threefold, you are wondering if your current single-partition DB2 database server will be able to handle the load, or if DB2 with DPF will be required. You are not too familiar with DB2 with DPF, so you decide to play around with it using your test servers: two SMP servers running Linux with four processors each. The previous DBA, who has left the company, had installed DB2 Enterprise with DPF on these servers. Fortunately, he left a diagram with his design, shown in Figure Figure 2.17 is a combined physical and logical design. When you validate the correctness of the diagram with your system, you note that database mydb1 has been dropped, so you decide to rebuild this database as practice. The instance db2inst1 is still there, as are other databases. These are the steps you follow. 1. Open two telnet sessions, one for each server. From one of the sessions you issue the commands db2stop followed by db2start, as shown in Figure The first thing you note is that there is no need to issue these two commands on each partition; issuing them on any partition once will affect all partitions. You also can tell that there are four partitions, since you received a message from each of them.

49 Understanding DB2: Learning Visually with Examples, Second Edition 75 Communication Facility Linux SMP Machine CPUs Memory Linux SMP Machine CPUs Memory Communication Facility Database Partition 0 Database Partition 1 Communication Facility Database Partition 2 Database Partition 3 Environment Variables Global Level Profile Registry Instance db2inst1 Database mydb1 Partition Grp Syscatspace IBMCATGROUP Partition Grp IBMTEMPGROUP Partition Grp IBMDEFAULTGROUP Instance Level Profile Registy DBM CFG System db Directory Node Directory DCS Directory DB CFG Local DB Directory Logs Bufferpool IBMDEFAULTBP Tempspace1 Userspace1 Partition Group pg23 MyTbls1 Table1 Index1 Bufferpool BP23 Bufferpool BP23 Port 0 Port 1 Port 0 Port 1 Port /db2database /db2inst1 /home/dasusr1 /NODE0001 /SQL0001 /home/db2inst1 /NODE0002 /SQL0001 Port /db2database /db2inst1 /home/dasusr1 /NODE0003 /SQL0001 /NODE0004 /SQL0001 Figure 2.17 DB2 Enterprise with DPF design [db2inst1@aries db2inst1]$ db2stop :44: SQL1064N DB2STOP processing was successful :44: SQL1064N DB2STOP processing was successful :44: SQL1064N DB2STOP processing was successful :44: SQL1064N DB2STOP processing was successful. SQL1064N DB2STOP processing was successful. [db2inst1@aries db2inst1]$ db2start :44: SQL1063N DB2START processing was successful :44: SQL1063N DB2START processing was successful :44: SQL1063N DB2START processing was successful :44: SQL1063N DB2START processing was successful. SQL1063N DB2START processing was successful. [db2inst1@aries db2inst1]$ Figure 2.18 Running the db2stop and db2start commands

50 76 Chapter 2 DB2 at a Glance: The Big Picture 2. Review the db2nodes.cfg file to understand the configuration of your partitions (see Figure 2.19). [db2inst1@aries sqllib]$ pwd /home/db2inst1/sqllib [db2inst1@aries sqllib]$ more db2nodes.cfg 0 aries.myacme.com 0 1 aries.myacme.com 1 2 saturn.myacme.com 0 3 saturn.myacme.com 1 [db2inst1@aries sqllib]$ Figure 2.19 A sample db2nodes.cfg file Using operating system commands, you determine that the home directory for instance db2inst1 is /home/db2inst1. The db2nodes.cfg file is stored in the directory /home/db2inst1/sqllib. Figure 2.19 shows there are four partitions, two per server. The server host names are aries and saturn. 3. Create the database mydb1. Since you want partition 0 to be your catalog partition, you must issue the CREATE DATABASE command from partition 0. You issue the statement db2 "values (current dbpartitionnum)" to determine which partition is currently active and find out that partition 3 is the active partition (see Figure 2.20). [db2inst1@saturn db2inst1]$ db2 "values (current dbpartitionnum)" record(s) selected. Figure 2.20 Determining the active partition 4. Next, you change the DB2NODE environment variable to zero (0) as follows (see Figure 2.21): DB2NODE=0 export DB2NODE db2 terminate In the CREATE DATABASE command you specify the path, /db2database in this example, which is an existing path that has been created locally on all servers so that the data is spread across them. 5. To confirm that partition 0 is indeed the catalog partition, simply issue a list db directory command and look for the Catalog database partition number field under the entry for the mydb1 database. Alternatively, issue a list tablespaces

51 Understanding DB2: Learning Visually with Examples, Second Edition 77 db2inst1]$ DB2NODE=0 db2inst1]$ export DB2NODE db2inst1]$ db2 terminate DB20000I The TERMINATE command completed successfully. db2inst1]$ db2 list applications SQL1611W No data was returned by Database System Monitor. SQLSTATE=00000 db2inst1]$ db2 create db mydb1 on /db2database DB20000I The CREATE DATABASE command completed successfully. db2inst1]$ db2 connect to mydb1 Database Connection Information Database server = DB2/LINUX 9.1 SQL authorization ID = DB2INST1 Local database alias = MYDB1 [db2inst1@saturn db2inst1]$ db2 "values (current dbpartitionnum)" record(s) selected. [db2inst1@saturn db2inst1]$ Figure 2.21 Switching the active partition, and then creating a database command from each partition. The SYSCATSPACE table space will be listed only on the catalog partition. 6. Create partition group pg23 on partitions 2 and 3. Figure 2.22 shows how to accomplish this and how to list your partition groups. Remember that this does not list IBMTEMPGROUP. 7. Create and manage your buffer pools. Issue this statement to create buffer pool BP23 on partition group pg23: db2 "create bufferpool BP23 database partition group pg23 size 500" Figure 2.23 shows this statement. It also shows you how to associate this buffer pool to another partition group using the ALTER BUFFERPOOL statement. To list your buffer pools and associated partition groups, you can query the SYSCAT.BUFFERPOOLS catalog view, also shown in Figure Note that a buffer pool can be associated with any partition group. Its definition will be applied to all the partitions in the partition group, and you can specify different sizes on the partitions if required.

52 78 Chapter 2 DB2 at a Glance: The Big Picture [db2inst1@saturn db2inst1]$ db2 "create database partition group pg23 on dbpartitionnum (2 to 3)" DB20000I The SQL command completed successfully. [db2inst1@saturn db2inst1]$ db2 "list database partition groups" DATABASE PARTITION GROUP IBMCATGROUP IBMDEFAULTGROUP PG23 3 record(s) selected. [db2inst1@saturn db2inst1]$ Figure 2.22 Creating partition group pg23 [db2inst1@saturn db2inst1]$ db2 "create bufferpool BP23 database partition group pg23 size 500" DB20000I The SQL command completed successfully. [db2inst1@saturn db2inst1]$ db2 "alter bufferpool BP23 add database partition group IBMCATGROUP" DB20000I The SQL command completed successfully. [db2inst1@saturn db2inst1]$ db2 "select bpname, ngname from syscat.bufferpools" BPNAME NGNAME IBMDEFAULTBP - BP23 PG23 BP23 IBMCATGROUP 3 record(s) selected. [db2inst1@saturn db2inst1]$ Figure 2.23 Managing buffer pools 8. Create the table space mytbls1: db2 "create tablespace mytbls1 in database partition group pg23 managed by system using ('/data') bufferpool bp23" 9. Create table table1 in table space mytbls1 with a distribution key of col1 and col2: db2 "create table table1 (col1 int, col2 int, col3 char(10)) in mytbls1 distribute by (col1, col2)" 10. Create the index index1. Note that this doesn t have any syntax specific to a DPF environment:

53 Understanding DB2: Learning Visually with Examples, Second Edition 79 db2 "create index index1 on table1 (col1, col2)" The index will be constructed on each partition for its subset of rows. 11. Test the db2_all command to update the database configuration file for all partitions with one command. Figure 2.24 shows an example of this. And that s it! In this case study you have reviewed some basic statements and commands applicable to the DPF environment. You reviewed the db2stop and db2start commands, determined and switched the active partition, and created a database, a partition group, a buffer pool, a table space, a table with a distribution key, and an index. You also used the db2_all command to update a database configuration file parameter. [db2inst1@aries sqllib]$ db2 get db cfg for mydb1 grep LOGFILSIZ Log file size (4KB) (LOGFILSIZ) = 1000 [db2inst1@aries sqllib]$ db2_all "db2 update db cfg for mydb1 using LOGFILSIZ 500" DB20000I The UPDATE DATABASE CONFIGURATION command completed successfully. aries.myacme.com: db2 update db cfg for mydb1 using LOGFILSIZ 500 completed ok DB20000I The UPDATE DATABASE CONFIGURATION command completed successfully. aries.myacme.com: db2 update db cfg for mydb1 using LOGFILSIZ 500 completed ok DB20000I The UPDATE DATABASE CONFIGURATION command completed successfully. saturn.myacme.com: db2 update db cfg for mydb1 using LOGFILSIZ 500 completed ok DB20000I The UPDATE DATABASE CONFIGURATION command completed successfully. saturn.myacme.com: db2 update db cfg for mydb1 using LOGFILSIZ 500 completed ok [db2inst1@aries sqllib]$ db2 get db cfg for mydb1 grep LOGFILSIZ Log file size (4KB) (LOGFILSIZ) = 500 [db2inst1@aries sqllib]$ Figure 2.24 Using db2_all to update the db cfg file 2.8 IBM BALANCED WAREHOUSE A data warehouse system is composed of CPUs (or processors), memory, and disk. A balanced mix is the key to optimal performance. Unbalanced, ill-planned configurations will waste a lot of money if the system has Too many or too few servers Too much or too little storage Too much or too little I/O capacity and bandwidth Too much or too little memory To help alleviate these issues, and ensure that DB2 customers build highly performing data warehouses, IBM has focused on a prescriptive and quality approach through the use of a proven balanced methodology for data warehousing. This is called the IBM Balanced Warehouse (BW) (formerly known as Balanced Configuration Unit or BCU), and is depicted in Figure 2.25.

These components are integrated and tested as a preconfigured building block for data warehousing systems.

54 80 Chapter 2 DB2 at a Glance: The Big Picture Figure 2.25 IBM Balanced Warehouse The BW applies specifically to DB2 data warehouses. It is composed of a methodology for deployment of the database software, in addition to the IBM hardware and other software components that are required to build a data warehouse. These components are integrated and tested as a preconfigured building block for data warehousing systems. The prescriptive approach used by the BW eliminates the complexity of database warehouse design and implementation by making use of standardized, end-to-end stack tested designs, and best practices that increase the overall quality and manageability of the data warehouse. This modular approach ensures that there is a balanced amount of disks, I/O bandwidth, processing power and memory to optimize the cost-effectiveness and throughput of the database. While you might start off with one BW, if you have a large database, you might also start off with several of these BW building blocks in a single system image. As the database grows, you can simply add additional building blocks (BWs) to handle the increased data or users as shown in Figure BW1 BW2 BWn Figure 2.26 BW as building blocks

55 Understanding DB2: Learning Visually with Examples, Second Edition 81 The main advantages of the BW are best-of-breed IBM components selected for optimum price/performance repeatable, scalable, consistent performance that can grow as business needs grow a prescriptive, fully validated and tested, successful best practices design that reduces the time and removes the risk of building a business intelligence solution 2.9 SUMMARY This chapter provided an overview of the DB2 core concepts using a big picture approach. It introduced SQL statements and their classification as Data Definition Language (DDL), Data Manipulation Language (DML), and Data Control Language (DCL) statements. XQuery with XPath and the FLWOR expression were also introduced. DB2 commands were classified into two groups system commands and CLP commands and several examples were provided, such as the command to start an instance, db2start. An interface is needed to issue SQL statements, XQuery statements, and commands to the DB2 engine. This interface is provided by using the DB2 tools available with the product. Two text-based interfaces were mentioned, the Command Line Processor (CLP) and the Command Window. The Control Center was noted as being one of the most important graphical administration tools, while the Data Server Administration Control was presented as the next generation of graphical administration tooling. This chapter also introduced the concepts of instances, databases, table spaces, buffer pools, logs, tables, indexes, and other database objects in a single partition system. The different levels of configuration for the DB2 environment were presented, including environment variables, DB2 registry variables, and configuration parameters at the instance (dbm cfg) and database (db cfg) levels. DB2 has federation support for queries referencing tables residing in other databases in the DB2 family. The chapter also covered the Database Partitioning Feature (DPF) and the concepts of database partition, catalog partition, coordinator node, and distribution map on a multipartition system. The IBM Balanced Warehouse (BW) was also introduced. Two case studies reviewed the single-partition and multipartition environments respectively, which should help you understand the topics discussed in the chapter REVIEW QUESTIONS 1. How are DB2 commands classified? 2. What is a quick way to obtain help information for a command? 3. What is the difference between the Information Center tool and simply reviewing the DB2 manuals? 4. What command is used to create a DB2 instance?

56 82 Chapter 2 DB2 at a Glance: The Big Picture 5. How many table spaces are automatically created by the CREATE DATABASE command? 6. What command can be used to get a list of all instances on your server? 7. What is the default instance that is created on Windows? 8. Is the DAS required to be running to set up a remote connection between a DB2 client and a DB2 server? 9. How can the DB2 environment be configured? 10. How is the local database directory populated? 11. Which of the following commands will start your DB2 instance? A. startdb B. db2 start C. db2start D. start db2 12. Which of the following commands will list all of the registry variables that are set on your server? A. db2set -a B. db2set -all C. db2set -lr D. db2set -ltr 13. Say you are running DB2 on a Windows server with only one hard drive (C:). If the DB2 instance is dropped using the db2idrop command, after recreating the DB2 instance, which of the following commands will list the databases you had prior to dropping the instance? A. list databases B. list db directory C. list db directory all D. list db directory on C: 14. If the list db directory on C: command returns the following: Database alias = SAMPLE Database name = SAMPLE Database directory = SQL00001 Database release level = a.00 Comment = Directory entry type = Home Catalog database partition number = 0 Database partition number = 0 which of the following commands must be run before you can access tables in the database? A. catalog db sample B. catalog db sample on local C. catalog db sample on SQL00001 D. catalog db sample on C:

57 Understanding DB2: Learning Visually with Examples, Second Edition If there are two DB2 instances on your Linux server, inst1 and inst2, and if your default DB2 instance is inst1, which of the following commands allows you to connect to databases in the inst2 instance? A. export inst2 B. export instance=inst2 C. export db2instance=inst2 D. connect to inst2 16. Which of the following DB2 registry variables optimizes interpartition communication if you have multiple partitions on a single server? A. DB2_OPTIMIZE_COMM B. DB2_FORCE_COMM C. DB2_USE_FCM_BP D. DB2_FORCE_FCM_BP 17. Which of the following tools is used to run commands on all partitions in a multipartition DB2 database? A. db2_part B. db2_all C. db2_allpart D. db2 18. Which of the following allows federated support in your server? A. db2 update db cfg for federation using FEDERATED ON B. db2 update dbm cfg using FEDERATED YES C. db2 update dbm cfg using NICKNAME YES D. db2 update dbm cfg using NICKNAME, WRAPPER, SERVER, USER MAPPING YES 19. Which environment variable needs to be updated to change the active logical database partition? A. DB2INSTANCE B. DB2PARTITION C. DB2NODE D. DB2PARTITIONNUMBER 20. Which of the following statements can be used to determine the value of the current active database partition? A. values (current dbpartitionnum) A. values (current db2node) C. values (current db2partition) D. values (current partitionnum)

60 George Baklarz Paul C. Zikopoulos DB2 9 for Linux, UNIX, and Windows DBA Guide, Reference, and Exam Prep, Sixth Edition DB2 9 builds on the world s number one enterprise database to simplify the delivery of information as a service, accelerate development, and dramatically improve operational efficiency, security, and resiliency. Now, this new edition offers complete, start-to-finish coverage of DB2 9 administration and development for Linux, UNIX, and Windows platforms, as well as authoritative preparation for the latest IBM DB2 certification exam. Written for both DBAs and developers, this definitive reference and self-study guide covers all aspects of deploying and managing DB2 9, including DB2 database design and development; day-to-day administration and backup; deployment of networked, Internet-centered, and SOA-based applications; migration; and much more. You ll also find an unparalleled collection of expert tips for optimizing performance, availability, and value. COVERAGE INCLUDES Important security and resiliency enhancements, including advanced access control; fine-grained, label-based security; and the new security administrator role Breakthrough purexml features that make it easier to succeed with service-oriented architecture Operational improvements that enhance DBA efficiency including self-tuning memory allocation, automated storage management, and storage optimization Table-partitioning features that improve scalability and manageability Powerful improvements for more agile and rapid development, including the new Eclipse-based Developer Workbench and simple SQL or XQuery access to all data Whatever your role in working with DB2 or preparing for certification, DB2 9 for Linux, UNIX, and Windows, Sixth Edition is the one book you can t afford to be without X hardcover 1136 pages Table of Contents Foreword Preface PART ONE: Introduction to DB2 Chapter 1: Product Overview Chapter 2: Getting Started Chapter 3: Getting Connected Chapter 4: Controlling Data Access PART TWO: Using SQL Chapter 5: Database Objects Chapter 6: Manipulating Database Objects Chapter 7: Advanced SQL Chapter 8: purexml Storage Engine Chapter 9: Development SQL Chapter 10: Concurrency PART THREE: DB2 Administration Chapter 11: Data Storage Management Chapter 12: Maintaining Data Chapter 13: Database Recovery Chapter 14: Monitoring and Tuning PART FOUR: Developing Applications Chapter 15: Application Development Overview Chapter 16: Development Considerations PART FIVE: Appendices Appendix A: DB2 9 Certification Test Objectives Appendix B: DB2DEMO Installation INDEX

C H A P T E R 8 8pureXML Storage Engine XML OVERVIEW XML DATA TYPE AND STORAGE XPATH AND XQUERY SQL/XML FUNCTIONS ADDITIONAL CONSIDERATIONS The DB2 9 purexml technology unlocks the latent potential

61 C H A P T E R 8 8pureXML Storage Engine XML OVERVIEW XML DATA TYPE AND STORAGE XPATH AND XQUERY SQL/XML FUNCTIONS ADDITIONAL CONSIDERATIONS The DB2 9 purexml technology unlocks the latent potential of XML data by providing simple efficient access to it with the same levels of security, integrity, and resiliency taken for granted with relational data, thereby allowing you to seamlessly integrate XML and relational data. DB2 9 stores XML data in a hierarchical structure that naturally reflects the structure of XML. This structure along with innovative indexing techniques allows DB2 to efficiently manage this data and eliminate the complex and time-consuming parsing typically required for XML. This chapter details the purexml feature available in DB2 9, as well as the advantages it has over a data server environment.

62 490 Chapter 8 purexml Storage Engine purexml Feature Pack DB2 9 includes an optional add-on feature pack, called purexml. For all editions of DB2 9, except DB2 9 Express - C, you have to purchase purexml services in addition to a base data server license. You can learn all about the DB2 9 editions, their corresponding feature packs, and guidance on which edition to choose for your business on the IBM DB2 9 Web site. For this reason, and more, it s important to not only understand the technology that purexml provides, but the alternative solutions available to justify its use. Before purexml: How XML is Traditionally Stored purexml technology is truly unique in the marketplace. To best understand how it s so unique, it s best to understand the traditional options for storing XML. Before purexml, XML could be stored in an XML-only database, shredded into a relational format, stored in a relational database intact within a large object (LOB), or stored on a file system (unfortunately the most common format). These options, and their respective trade-offs, are shown below (Figure 8 1): Figure 8 1 XML Storage options The XML-only Database For the longest time there have been a number of niche corporations that offer XML-only databases. While XML databases store XML as XML, their weakness is their only strength: they can only do that store XML. In other words, they create a silo of information that ultimately requires more work and engineering to integrate with your relational systems. As more and more businesses move towards XML, challenges around integration with the relational data (which isn t going away; rather, it will be complimented by XML) arise. An XML-only database ultimately adds unforeseen costs to an environment because different programming models, administrative skills sets, servers, maintenance plans, and more are implicitly tagged to such a data storage solution. What s more, end users will experience

63 D2 9 for Linux, UNIX, and Windows, Sixth Edition 491 different service levels since there is no capability to unify a service-level across separate servers. Database administrators (DBAs) live and breathe by their servicelevel agreements (SLAs), and the separation of data servers based on the semantic structure of the data is an obstacle to sustainable SLA adherence among other efficiencies. Storing XML in a File System Perhaps this most common way to store XML is on a file system, which is perhaps a risky practice considering today s regulatory compliance requirements. The biggest problem associated with storing critical XML data on a file system is that file systems can t offer the Atomic, Consistent, Isolated, and Durable (ACID) properties of a database. Consider the following questions: How can file systems handle multiple changes that impact each other? What is the richness of the built-in recovery mechanism in case of a failure? (What about point-in-time recovery and recovery point objectives?) How do you handle complex relationships between your data artifacts in a heterogeneous environment? What type of facilities exist for access parallelism? Is there an abstracted data access language like XQuery or SQL? How is access to the same data handled? Furthermore, storing your XML on a file system and then interacting with it from your application requires a lot of hand coding and is prone to errors. Quite simply, the decision to store critical XML data on a file system forces you to accept the same risks and issues associated with storing critical data in Excel spreadsheets, there is a serious issue associated with data integrity and quality. How did all that XML get stored on a file system in the first place you may ask? Simple: application developers. If you re a DBA and you don t sense the need for XML, you need to talk to the application development team. You re bound to find many reasons why you need to take an active role in the decisions that affect the storage local of XML data in your enterprise. purexml XML Within a LOB in a Relational Database One option provided by most of today s relational database vendors is to take the XML data, leave it as XML, and place it in a LOB column within a relational table. This solves a number of issues, yet it presents new ones too. When the XML is placed in a LOB, you indeed unify XML and the relational data. And in most cases, you can take advantage of what XML was designed for flexibility because you are not tightly bound to a relational schema within the LOB

64 492 Chapter 8 purexml Storage Engine container. The problem with storing XML data in a LOB is that you pay a massive performance price for this flexibility because you have to parse the XML when it is required for query. It should be noted that this storage method may work well for large documents that are intended to be read as a whole (for example, a book chapter). However, by and large, XML is moving to facilitate transactional processing as evident by the growing number of standards that have emerged in support for this. For example, the FiXML protocol is a standard XML-based markup language for financial transactions. Other vertically aligned XML-based markup languages include: Justice XML Data Dictionary (JXDD) and Justice XML Registry and Repository (JXRR) and their corresponding RapSheet.xsd, DriverHistory.xsd, and ArrestWarrent.xsd XML Schema Definition (XSD) documents Air Quality System Schema (Asbestos Demolition & Removal Schema) Health Level 7 (HL7 ) for patient management, diagnosis, treatments, prescriptions, and so on Interactive Financial Exchange (IFX) standard for trades, banking, consumer transactions, and so on ACORD standard for policy management such as underwriting, indemnity, claims, and so on IXRetail standard for inventory, customer transaction, and employee management and many, many more. As companies store massive orders in XML and want to retrieve specific order details, they ll be forced to parse all of the orders with this storage mechanism. For example, assume a financial trading house is sent a block of stock trades on the hour for real-time input into a fraud detection system. To maintain the XML, the data is placed in a LOB. If a specific request came to take a closer look at trades by Mr. X, the database manager would have parsed all the trades in a block to find Mr. X s trades. Imagine this overhead if Mr. X made a single trade in the first hour and the block contained 100,000 trades. That s a lot of overheard to locate a single transaction. It quickly becomes obvious that while LOB storage of XML unifies the data, it presents a new issue, namely performance. There are others, depending on a vendor s implementation of XML storage, but this is the main issue you should be aware of. XML Shredded to a Table in a Relational Database Another option provided by most of today s relational database vendors is to take XML data and shred it into a relational table. In this context, shred means that there

65 D2 9 for Linux, UNIX, and Windows, Sixth Edition 493 is some mapping that takes the data from the XML format and places it into a relational table. Consider the following example (Figure 8 2): Figure 8 2 XML Shredding In the previous example you can see a single XML document has been shredded into two different relational tables: DEPARTMENT and EMPLOYEE. This example illustrates 3rd Normal Form (no repeating elements). In this storage mechanism, XML data is unified with its relational counterpart since it s actually stored with the relational data. Additionally, there is no parsing overhead since data is retrieved as regular relational data via SQL requests. From a performance perspective, shredding XML to relational data seems to solve the aforementioned issues; however, there are a number of disadvantages with this approach as well. XML is all about flexibility. It s the key benefit of this technology. Application developers (big proponents of XML) like the flexibility to alter a schema without the long process that s typically well understood by DBAs. Consider this example: If a new business rule established that the Human Resources department now required employee s cellular phone numbers for business emergencies, what ill effects would such a change present XML data whose storage mechanism was to shred it to a set of relational tables? First, the mapping (not shown in Figure 8 2) between the XML data and the relational tables would have to be updated; this could cause application changes, testing requirements, and more. Of course, to maintain 3rd Normal Form you may purexml

66 494 Chapter 8 purexml Storage Engine actually be forced to create a new table for phone numbers and create referential integrity rules to link them together as shown below (Figure 8 3): Figure 8 3 XML schema changes You can see in this example that the flexibility promise associated with XML is quickly eroded when shredding is used for storage, because a change to the XML Schema requires a change to the underlying relational schema. In addition, a change to the XML Schema requires a change to the mapping tier between the XML document and the relational database. Also consider that when shredded, access to the XML data is via SQL, meaning that the newly engineered language specific to querying XML data (XQuery) must be translated into SQL in this model. This can cause ineffectiveness, loss of function, and more. Essentially, the shredding model solves the performance issue (although there is a performance penalty to rebuild the shredded data as XML), but it presents major flexibility issues that negate the most significant benefit of XML.

67 D2 9 for Linux, UNIX, and Windows, Sixth Edition 495 The Difference: purexml The LOB and shredded storage formats are used by every relational database vendor today, including DB2 s XML Extender (which is available for free in DB2 9). The underlying architecture (the storage of the XML) is often overlooked because an XML data type is often presented at the top of the architecture, hence the term native XML. As a DBA, you should be sure to look beneath the veneer of a data type for true efficiencies when storing your XML data. This section introduces you to the purexml feature and how to use it in DB2 9. How DB2 Stores XML in a purexml Column purexml introduces a new storage service that s completely transparent to applications and DBAs, yet it provides XML storage services that don t face any of the trade-offs associated with the XML storage methods discussed in the previous section. Note You can quickly get access to sample XML tables and data by creating the DB2 SAMPLE database with the XML extensions. When using UNIX-based platforms, at the operating system command prompt, issue sqllib/bin/db2sampl -dbpath <path> from the home directory of the database manager instance owner, where path is an optional parameter specifying the path where the SAMPLE database is to be created. If the path parameter is not specified, the sample database is created in the default path specified by the DFTDBPATH parameter in the database manager configuration file. The schema for DB2SAMPL is the value of the CURRENT SCHEMA special register. purexml When using a Windows platform, at the operating system command prompt, issue db2sampl -dbpath e where e is an optional parameter specifying the drive where the database is to be created. If the drive parameter is not specified, the sample database is created on the same drive as DB2. The purexml feature in DB2 stores XML data on-disk as XML in a query-rich DOM-like parsed tree format. This structure is similar to that of the XML-only databases and provides massive performance and flexibility benefits over the other approaches, and because it s within DB2, it s integrated with the relational data. At

496 Chapter 8 purexml Storage Engine the same time, relational data is stored in regular rows and columns. In other words, DB2 has two storage services, one for XML and one for relational data.

68 496 Chapter 8 purexml Storage Engine the same time, relational data is stored in regular rows and columns. In other words, DB2 has two storage services, one for XML and one for relational data. Both of these storage mechanisms are transparent to DBAs. An example is shown below (Figure 8 4): Figure 8 4 XML schema changes In the previous figure you can see just how transparent these storage services are. If you are querying the database with SQL, you can access both XML and relational data (shown as rdb in the figure), or mix SQL and XQuery together. Likewise, you can use XQuery to get to both XML and relational data, or mix them together as well. DB2 is able to support this language flexibility because both access APIs are compiled down to an intermediate language-neutral representation called the Query Graph Model (QGM). The DB2 optimizer then takes the QGM and generates an optimized access plan for data access. The on-disk parsed representation of purexml columns is done by the Xerces open source parser and is stored in the UTF-8 format regardless of the document encoding, the locale, or the codepage of the database. The on-disk XML is often referred to as a tree. Branches of this tree represent elements, attributes, or text entities that contain name, namespace, namespace prefixes (stringids), type annotations (pathid), pointers to parent elements, pointers to child elements, and possible in-lined values. The on-disk storage unit for DB2 is still a data page for XML, just like with relational data. This means that if an XML document or fragment can t fit on a single page (the maximum size is 32 KB), it has to be chained to other pages to create the document.

69 D2 9 for Linux, UNIX, and Windows, Sixth Edition 497 Since XML documents that don t fit on a single page must be split into multiple pages (referred to as regions in the XML tree), if your table space was using a 4 KB page size and your XML document was 12 KB in size after parsing, it would require 3 separate data pages to store it. For example, consider the previously mentioned scenario with a single XML document of 12 KB. In this case, the XML document s on-disk representation may look like Figure 8 5: Figure 8 5 XML spread across multiple disks If DB2 needs to find XML data that s stored on the same data page (for example, looking for data in the darker nodes), it can do so at pointer traversal speeds. However, what happens if data is requested from the XML document that resides on a different data page (region)? When the on-disk representation is created for your XML document, DB2 automatically creates (and maintains) a regions index such that portions of the same XML document that reside on different pages can be quickly located and returned to the database engine s query processing, as shown in Figure 8 6. purexml

70 498 Chapter 8 purexml Storage Engine Figure 8 6 XML regions Again, this index is not DBA-defined; it s a default component of the purexml storage services. From the previous figure you can see that nodes are physically connected on the same data page, and regions (different pages) are logically connected and appropriately indexed. Because an on-disk parsed representation of the XML data typically takes up more space than if the XML document were stored in an un-parsed format (plan for ondisk XML documents in purexml to be at least twice as big as their serialized counterpart), purexml columns automatically compress its data. (This compression shouldn t be confused with the Storage Optimization feature s Deep Compression capabilities; rather, it s an internal automatic compression algorithm.) The reason that a purexml column takes up extra space is because it s sitting on the disk in a parsed format such that query response times are incredibly fast, even for sub-document retrieval. In other words, it doesn t have to incur the overhead of parsing, which is associated with the LOB storage method, and it s never shredded (so all its flexibility is maintained), which was the issue with the shredding method. With a purexml column you can see that there s a trade-off as well; after all, nothing in the world is free. purexml columns offer you better performance, function, and flexibility than any other storage mechanism, but at the cost of more disk space. Considering that the per-mb cost of disk space has been decreasing at an incredible rate, these seem like a good trade-offs to make.

71 D2 9 for Linux, UNIX, and Windows, Sixth Edition 499 To see how DB2 handles XML data on its way to the on-disk format, start with an XML document like the one below (Figure 8 7): <dept> <employee id=901> <name>john Doe</name> <phone> </phone> <office>344</office> </employee> <employee id=902> <name>peter Pan</name> <phone> </phone> <office>216</office> </employee> </dept> Figure 8 7 XML fragment The XML document in the previous form is considered serialized. When it s parsed, its representation turns into something that an application like DB2 can use. There are two mainstream types of parsers in today s market: DOM and SAX. A DOM-parsed XML document takes up more disk space and is slower to get into its parsed format when compared to XML data that is SAX parsed. However, the DOM format is faster for queries because the on-disk format is optimized for query retrieval. A SAX parser is an event parser; it creates a stream of events that correlates to the XML document. A SAX-parsed XML document isn t as efficient for queries in that it typically takes some extra work to get it to perform; however, it parses faster than its DOM counterpart and takes up less space. DB2 9 attempts to combine the best of both worlds by SAX parsing the XML document and storing it in a DOM-like structure on disk. For example, the previous XML document after being parsed by DB2, but before DB2 compressed the data, would look like this (Figure 8 8): purexml Figure 8 8 Pre-compressed XML structure

72 500 Chapter 8 purexml Storage Engine After DB2 9 compresses the XML document (its on-disk format), it would look like this (Figure 8 9): Figure 8 9 Compressed XML structure You can see in the previous figure that a table is created to store the string values that are compressed. Essentially the compression algorithm converts tag names into integers, and the translation of this representation is held in an internal table called SYSIBM.SYSXMLSTRINGS (although you don t need to interact with this table). DB2 s XML compression is efficient because not only does it reduce storage requirements (which should be obvious by the previous figure), it also creates an environment for faster string comparisons and navigation of the hierarchy because it can do so via integer math, which is generally faster than string comparisons. Creating an XML-enabled Database Before you can work with purexml columns, you need to create a database that s capable of storing them. At the time DB2 9 became generally available, purexml columns could only be stored in databases that were created to support the UTF-8 Unicode format. You can create a UTF-8 Unicode-enabled database in DB2 9 by specifying the USING CODESET UTF-8 option from the DB2 CLP (Example 8 1): CREATE DATABASE MYXML AUTOMATIC STORAGE YES ON 'D:\' DBPATH ON 'D:\' USING CODESET UTF-8 TERRITORY US COLLATE USING SYSTEM PAGESIZE 4096 WITH 'This is my XML Database' Example 8 1 Creating an XML enabled database

73 D2 9 for Linux, UNIX, and Windows, Sixth Edition 501 There are some features in DB2 that do not support purexml. For example, the Database Partitioning Feature (DPF) and the Storage Optimization Feature (for deep row compression) currently do not support purexml columns. The other option is to use the Create Database Wizard as shown below (Figure 8 10): Figure 8 10 Using the Create Database Wizard to create an XML-enabled database purexml Note Future versions of DB2 may alleviate this restriction of only supporting XML columns in Unicode databases (UTF-8). This information is current as of DB2 9 and Fix Pack 2. The following database will be used in the examples throughout this help topic (Example 8 2): CREATE DATABASE MYXML AUTOMATIC STORAGE YES USING CODESET UTF-8 TERRITORY US Example 8 2 Sample XML database

74 502 Chapter 8 purexml Storage Engine Creating Tables with purexml Once a database has been enabled for purexml, you create tables that store XML data in these columns in the same manner that you create tables with only relational data. One of the great flexibility benefits of purexml is that it doesn t require supporting relational columns to store XML. You can choose to create a table that has single or multiple XML columns, that has a relational column with one or more XML columns, and so on. For example, any of the following CREATE TABLE statements are valid (Example 8 3): CREATE TABLE XML1 ( ID INTEGER NOT NULL, CUSTOMER XML, CONSTRAINT RESTRICTID PRIMARY KEY(ID) ); CREATE TABLE XML2 ( CUSTOMER XML, ORDER XML ); Example 8 3 Tables with XML columns You can create tables with purexml columns using either the DB2 CLP or the Control Center, as shown in Figure The XML data type integration extends beyond the DB2 tools and into popular integrated development environments (IDEs) like IBM Rational Application Developer, Microsoft Visual Studio 2005, Quest TOAD, and more. This means that you can work with purexml columns from whatever tool or facility you feel most comfortable.

D2 9 for Linux, UNIX, and Windows, Sixth Edition 503 Figure 8 11 Using the Create Table Wizard to create an XML column The examples in this chapter assume that the following table has been created in

capabilities This table will be used to store commuter information (based in XML) alongside an ID that represents the Web site s customer who is building the commuter community (the names of the

75 D2 9 for Linux, UNIX, and Windows, Sixth Edition 503 Figure 8 11 Using the Create Table Wizard to create an XML column The examples in this chapter assume that the following table has been created in an XML-enabled database (Example 8 4): CREATE TABLE COMMUTERS ( ID INTEGER NOT NULL, CUSTOMER XML, CONSTRAINT RESTRICTID PRIMARY KEY(ID) ); purexml Example 8 4 Table used to demonstrate XML capabilities This table will be used to store commuter information (based in XML) alongside an ID that represents the Web site s customer who is building the commuter community (the names of the commuters are purposely not entered into the database). Commuter information is retrieved from various customers who want to carpool to work, and their information is received via a Web form and passed to DB2 via XML a typical Web application.

504 Chapter 8 purexml Storage Engine Inserting Data into purexml Columns Once you have a purexml column defined in a table, you can work with it, for the most part, as with any other table.

76 504 Chapter 8 purexml Storage Engine Inserting Data into purexml Columns Once you have a purexml column defined in a table, you can work with it, for the most part, as with any other table. This means that you may want to start by populating the purexml column with an XML document or fragment. Consider the following XML document (Example 8 5): <?xml version="1.0" encoding="utf-8"?> <customerinfo xmlns=" xmlns:xsi=" xsi:schemalocation=" C:\Temp\Example\Customer.xsd"> <CanadianAddress> <Address>434 Rory Road</Address> <City>Toronto</City> <Province>Ontario</Province> <PostalCode>M5L1C7</PostalCode> <Country>Canada</Country> </CanadianAddress> <CanadianAddress> <Address>124 Seaboard Gate</Address> <City>Whitby</City> <Province>Ontario</Province> <PostalCode>L1N9C3</PostalCode> <Country>Canada</Country> </CanadianAddress> </customerinfo> Example 8 5 Sample XML document Because this is a well-formed XML document, you will be able to see it in an XML-enabled browser, such as Internet Explorer (Figure 8 12). Figure 8 12 XML document displayed in a browser

D2 9 for Linux, UNIX, and Windows, Sixth Edition 505 You can use the SQL INSERT statement to add this data to the COMMUTERS table in the same manner that you would add it to a table with only

77 D2 9 for Linux, UNIX, and Windows, Sixth Edition 505 You can use the SQL INSERT statement to add this data to the COMMUTERS table in the same manner that you would add it to a table with only relational data. The data inserted into the purexml column must be a well-formed XML document (or fragment), as defined by the XML 1.0 specification. For example, to insert the working example XML document using the CLP, issue the following command (Example 8 6): INSERT INTO COMMUTERS VALUES (1, '<?xml version="1.0" encoding="utf-8"?> <customerinfo xmlns </customerinfo>') Example 8 6 Sample XML document The full XML document is placed between the two single quotes. You can see this data is now in the purexml column (Figure 8 13): purexml Figure 8 13 Viewing XML data in the command center Since purexml provides you with the utmost flexibility for storing XML, you don t have to just store full XML documents as in Example 8 6. You could store a fragment of that document in the same table.

506 Chapter 8 purexml Storage Engine For example, to add the XML document in the working example (one full <CanadianAddress> block), you could enter the following command (Example 8 7): INSERT INTO

78 506 Chapter 8 purexml Storage Engine For example, to add the XML document in the working example (one full <CanadianAddress> block), you could enter the following command (Example 8 7): INSERT INTO COMMUTERS VALUES (2, '<CanadianAddress> <Address>434 Rory Road</Address><City>Toronto</City> <Province>Ontario</Province> <PostalCode>M5L1C7</PostalCode> <Country>Canada</Country> </CanadianAddress>') Example 8 7 Inserting an XML fragment Now the COMMUTERS table has two rows with XML data, one with a full XML document and the other with an XML fragment (Figure 8 14): Figure 8 14 Displaying the XML fragment The previous figure visualizes the XML data in the COMMUTERS table using the DB2 9 Command Editor, a tool that is included as part of the DB2 Control Center. If you recall, DB2 must store well-formed XML documents and fragments, although you have the option of validating them against XML Schema Definition (XSD) documents.

D2 9 for Linux, UNIX, and Windows, Sixth Edition 507 If you tried to insert a non-well-formed XML document or fragment into a purexml column you would receive an error as shown in Figure 8 15.

79 D2 9 for Linux, UNIX, and Windows, Sixth Edition 507 If you tried to insert a non-well-formed XML document or fragment into a purexml column you would receive an error as shown in Figure Figure 8 15 Invalid XML fragment In the previous example you can see that the <Address> element isn t closed properly it uses <Address> to close it instead of the proper </Address> element. Because this element isn t closed properly, as per the standard, this fragment isn t considered to be well-formed, and therefore the insert operation fails. Finally, you should be aware that there are multiple options to use when using the INSERT command to insert XML data into a table. purexml

80 508 Chapter 8 purexml Storage Engine INSERT INTO COMMUTERS VALUES ( 3, XMLPARSE (DOCUMENT '<?xml version="1.0" encoding="utf-8"?> <customerinfo xmlns=" xmlns:xsi=" xsi:schemalocation=" C:\Temp\Example\Customer.xsd"> <CanadianAddress> <Address>434 Rory Road</Address> <City>Toronto</City> <Province>Ontario</Province> <PostalCode>M5L1C7</PostalCode> <Country>Canada</Country> </CanadianAddress> </customerinfo> PRESERVE WHITESPACE) ), ( 4, XMLPARSE (DOCUMENT '<?xml version="1.0" encoding="utf-8"?> <customerinfo xmlns=" xmlns:xsi=" xsi:schemalocation=" C:\Temp\Example\Customer.xsd"> <CanadianAddress> <Address>124 Seaboard Gate</Address> <City>Whitby</City> <Province>Ontario</Province> <PostalCode>L1N9C3</PostalCode> <Country>Canada</Country> </CanadianAddress> </customerinfo> ) ); Example 8 8 Inserting an XML fragment Looking at Example 8 8, the INSERT statement illustrates the flexibility you have when inserting XML data into purexml columns. First, note that two separate XML documents were inserted in a single INSERT statement. Also note the use of various options that affect XML insert activity into purexml columns, some of which include DOCUMENT, PRESERVE WHITESPACE, and so on. Others include the ability to validate the XML data against a registered XML schema definition document (the XMLVALIDATE option). The first row in the INSERT statement (ID=3) specifically requests that whitespace be preserved in the on-disk format via the PRESERVE WHITESPACE option. In contrast, the second XML document doesn t make such a request, and therefore boundary whitespace will be stripped (the default) from this document.

81 D2 9 for Linux, UNIX, and Windows, Sixth Edition 509 White Space Preservation The PRESERVE WHITESPACE option was specified in Example 8 8. This option, as its name indicates, instructs DB2 to preserve whitespace in the XML document when it parses out the on-disk format. This option specifies that all whitespace is to be preserved even when the nearest containing element has the attribute xml:space='default'. Generally, this option is used when the highest level of fidelity is required for the XML document when it s stored in a purexml column. For example, a typical trading order may be regulated to be returned to an application in the manner in which it was inserted into the data server. In contrast, the STRIP WHITESPACE option specifies that any text nodes containing only whitespace characters up to 1,000 bytes in length be stripped, unless the nearest containing element has the attribute xml:space='preserve'. If any text node begins with more than 1,000 bytes of whitespace, an SQLSTATE error is returned. Whitespace characters in the CDATA section are also affected by this option. When stripping whitespace you should be aware that, according to the XML standard, whitespace also includes space characters (U+0020), carriage returns (U+000D), line feeds (U+000A), or tabs (U+0009) that are in the document to improve readability. When any of these characters appear as part of a text string, they are not considered to be whitespace. When you use the STRIP WHITESPACE option, the whitespace characters that appear between elements are stripped. For example, in the following XML fragment, the spaces between <a> and <b> and between </b> and </a> are considered to be boundary whitespace (Example 8 9). purexml <a> <b><c>this and That</c></b> </a> Example 8 9 Whitespace in XML If you want DB2 to preserve whitespace by default, you need to change the value of the special registers from STRIP WHITESPACE to PRESERVE WHITESPACE using the following command (Example 8 10): SET CURRENT IMPLICIT XMLPARSE OPTION='PRESERVE WHITESPACE' Example 8 10 Preserving whitespace implicitly in all XML documents Depending on the business problem you re trying to solve, or the regulatory compliance issues that may govern your day-to-day operations, you need to carefully consider and interpret the term fidelity and how it applies to your business. For

82 510 Chapter 8 purexml Storage Engine example, some financial compliance regulations state that a transaction action must look exactly as it did going into the database. The XML InfoSet standard actually breaks this rule, so depending on your application, the PRESERVE WHITESPACE option still won t guarantee the truest form of fidelity since DB2 follows the XML standard. In these circumstances, DBAs often store an XML transaction in a LOB (for audit purposes) and in a purexml column (for speedy search, retrieval, and transaction processing). Adding Data Through an Application The purexml support in DB2 9 has interfaces to support the most popular programming methodologies (CLI/ODBC, Java. ADO.NET, Ruby on Rails, PHP, and so on). Typically you re not going to manually insert XML data in your tables. If you insert XML data programmatically, it s recommended that the application insert the data from host variables rather than literals. This way, the DB2 data server can use the host variable data type to determine some of the encoding information. For example, the code for a JDBC application that wants to read XML data in a binary format and insert it into a purexml column may look like: PreparedStatement insertstmt = null; String sqls = null; int cid = 1015; sqls = "INSERT INTO MyCustomer (Cid, Info) VALUES (?,?)"; insertstmt = conn.preparestatement(sqls); insertstmt.setint(1, cid); File file = new File("commuterdata.xml"); insertstmt.setbinarystream(2, new FileInputStream(file), (int)file.length()); insertstmt.executeupdate(); Example 8 11 Java application inserting XML into DB2

83 D2 9 for Linux, UNIX, and Windows, Sixth Edition 511 Inserting Large Amounts of XML Data If you re a DBA responsible for large data movement operations, you re already quite familiar with the IMPORT and LOAD utilities available in DB2. For purexml columns, as of the time DB2 9 became generally available, large data population operations involving purexml columns must be handled by the IMPORT utility. Note The population of large amounts of XML into purexml columns is evolving throughout the DB2 9 release; you should monitor future DB2 9 releases for possible added support for purexml columns and the LOAD utility. If XML information is passed to you in an ASCII format (or you have the ability to easily put your XML documents into such a format), you can use the IMPORT utility to populate your purexml columns in a more efficient manner than just running individual INSERT operations. What s nice about the IMPORT utility s support for purexml columns is that you don t have to handle this process programmatically in an application; you simply need to have a delimited (DEL) ASCII file with the XML data. The XML data is placed into files that are separate from the delimited data. IMPORT handles XML data bound for purexml columns in a similar fashion to LOB data for LOB columns: a source file is used that contains the actual relational data and pointers are used to point the IMPORT utility to the location where the LOB data exists. In the case of purexml columns, the pointer is called an XML Data Specifier (XDS) and it points to a file that contains the XML data. For example, consider the IMPORT file for the COMMUTERS table in Figure This file points to three separate XML fragments (not whole XML documents) that contain individual commuter information. purexml Note If you had some rows that didn t have XML data associated with them, you could simply omit the "<XDS FIL=' '/>" pointer for that row.

84 512 Chapter 8 purexml Storage Engine Figure 8 16 IMPORT file specification of XML columns Once you ve defined the import source file, simply call it from the IMPORT command as shown in Example IMPORT FROM 'C:\TEMP\XMLIMPORTFILE.TXT OF DEL' XML FROM 'C:\TEMP\' MODIFIED BY XMLCHAR MESSAGES 'C:\TEMP\IMPORTMESSAGES.TXT' INSERT INTO COMMUTERS Example 8 12 IMPORT command

85 D2 9 for Linux, UNIX, and Windows, Sixth Edition 513 Selecting Data from purexml Columns You can use XQuery, SQL, SQL/XML, or any combination thereof, to retrieve data stored in purexml columns as illustrated below (Figure 8 17): CLIENT Data management client Customer client application SQL(X) XQuery Relational Interface XML Interface SERVER DB2 Server Relational Storage XML Storage Figure 8 17 XQuery and SQL relationship to DB2 In practice, depending on the data you want to retrieve, you ll use different interfaces to solve different problems; however, this ultimately depends on your skill set. If your enterprise is well versed in all the XML data access methods, you ll be able to use varying approaches to data retrieval that will yield benefits that relate to development time, performance, efficiency, and more. The easiest (yet perhaps most inefficient) method to retrieve data from a purexml column is to use only SQL, as shown in some of the previous examples. However, generally more value can be delivered to an application that uses SELECT, SQL/ XML, XQuery, or some combination thereof. The remainder of this section briefly discusses both SQL/XML and the XQuery approach for XML data retrieval. purexml An Introduction to SQL/XML SQL/XML provides support for XML through the SQL syntax, an ANSI and ISO standard. It s somewhat popular because of the rich talent pool of SQL developers, and this standard syntax allows them to use their skill set to work with XML data in relational databases. DB2 supports SQL/XML. There are at least a dozen SQL/XML functions in DB2 9, and the list keeps growing, adding more and more capability to the XML-ability of SQL.

86 514 Chapter 8 purexml Storage Engine Some of the SQL/XML functions include: XMLAGG An aggregate function that returns an XML sequence containing an item for each non-null value in a set of XML values. XMLATTRIBUTES A scalar function that constructs XML attributes from passed arguments. This function can only be used as an argument of the XMLELEMENT function. XMLCOMMENT A scalar function that returns an XML value with a single XQuery comment node using the input argument as the content. XMLCONCAT A scalar function that returns a sequence containing the concatenation of a variable number of XML input arguments. XMLDOCUMENT A scalar function that returns an XML value with a single XQuery document node with zero or more child nodes. This function creates a document node, which by definition, every XML document must have. A document node is not visible in the serialized representation of the XML; however, every document to be stored in a DB2 table must contain a document node. Note that the XMLELE- MENT function does not create a document node, only an element node. When constructing XML documents that are to be inserted, it s not sufficient to create only an element node the document must contain a document node. XMLELEMENT A scalar function that returns an XML value that is an XML element node. XMLFOREST A scalar function that returns an XML value that is a sequence of XML element nodes. XMLNAMESPACES Used to construct namespace declarations from passed arguments. This declaration can only be used as an argument of the XMLELEMENT, XMLFOREST, and XMLTABLE functions. XMLPI A scalar function that returns an XML value with a single XQuery processing instruction node.

87 D2 9 for Linux, UNIX, and Windows, Sixth Edition 515 XMLTEXT A scalar function that returns an XML value with a single XQuery text node where the input argument is the actual content. Example 8 13 creates a relational table and then populates it with INSERT statements. CREATE TABLE CARPOOL ( POOLER_ID INT NOT NULL, POOLER_FIRSTNAME VARCHAR(20) NOT NULL, POOLER_LASTNAME VARCHAR(20) NOT NULL, POOLER_EXTENSION INT NOT NULL ); INSERT INTO CARPOOL VALUES (1,'George','Martin',1234), (2,'Fred','Flintstone',2345), (3,'Barney','Rubble',3456), (4,'Wilma','Rockafeller',9876); Example 8 13 Sample tables for XML function examples An SQL/XML statement that would return relational data within XML tags is shown in Example 8 14: SELECT XML2CLOB (XMLELEMENT(NAME "CarPoolers", XMLELEMENT(NAME "ID", C.POOLER_ID), XMLELEMENT(NAME "Firstname", C.POOLER_FIRSTNAME), XMLELEMENT(NAME "Lastname", C.POOLER_LASTNAME), XMLELEMENT(NAME "Extension", C.POOLER_EXTENSION)) ) AS "Result" FROM CARPOOL C; purexml Example 8 14 SQL/XML publishing example The results are returned as a character string to the application. You can see the use of SQL and SQL/XML on a relational table (note that this is not data stored in a purexml column) in Figure 8 18.

516 Chapter 8 purexml Storage Engine Figure 8 18 SQL/XML results An Introduction to XQuery XQuery is a programming language that was built from the ground up to query XML data.

88 516 Chapter 8 purexml Storage Engine Figure 8 18 SQL/XML results An Introduction to XQuery XQuery is a programming language that was built from the ground up to query XML data. It s an open standard that has the full backing from the World Wide Web Consortium (W3C ). This language arose from the inefficiencies that were associated with using SQL to query XML hierarchies. When you think about it, relational tables are sets of rows and columns. Operations on these relations are done with mathematics via abstracted algebraic expressions, namely SQL. SQL contains functions to work on relations, using a more natural language, with support for projections, restrictions, permutations, and more. However, XML data is hierarchal in nature, not flat like its relational counterpart. Therefore, a new language was built specifically to navigate hierarchical structures and the full flexibility of XML. The end result was the birth of XQuery. For example, you might need to create XML queries that perform the following operations: Search XML data for objects that are at unknown levels of the hierarchy. Perform structural transformations on the data (for example, you might want to invert a hierarchy).

89 D2 9 for Linux, UNIX, and Windows, Sixth Edition 517 Return results that have mixed types. In XQuery, the FOR-LET-WHERE-ORDERBY-RETURN statement (commonly known as a FLWOR expression) is used to perform such operations. For example, the following XQuery statement would retrieve the first <PostalCode> element in the working example document (Example 8 15): XQUERY FOR $y IN db2-fn:xmlcolumn ('COMMUTERS.CUSTOMER')/CanadianAddress/PostalCode RETURN $y Example 8 15 Retrieving the PostalCode from an XML document There are many operations and options available with XQuery. If you wanted to restrict the result set, you could enter something similar to the following (Example 8 16): XQUERY FOR $y IN db2-fn:xmlcolumn ('COMMUTERS.CUSTOMER')/CanadianAddress/ WHERE $y/postalcode="ml51c7" RETURN $y Example 8 16 XQuery WHERE clause DB2 also has a rich XQuery builder available in the Developer Workbench that can make building these expressions easier (Figure 8 19). Note The purpose of this help topic wasn t to teach you about XQuery, but to introduce you to its existence. The DB2 Information Center and IBM Developers Domain Web sites have many tutorials and reference materials regarding this subject. purexml

518 Chapter 8 purexml Storage Engine Figure 8 19 DB2 Developer Workbench SQL/XML, SQL, or XQuery Best Practices The following table summarizes which APIs are best suited to perform which kind of

90 518 Chapter 8 purexml Storage Engine Figure 8 19 DB2 Developer Workbench SQL/XML, SQL, or XQuery Best Practices The following table summarizes which APIs are best suited to perform which kind of operations on XML data (Table 8 1): Table 8 1 XML API Summary Feature, Function, Benefit SQL SQL/XML XQuery XQuery & SQL/XML XML Predicates N/A Best Best Best Relational Predicates Best Best N/A Good XML and Relational N/A Best N/A Best Joining XML and Relational N/A Best N/A Best Joining XML and XML N/A Good Best Best Transforming XML Data N/A Awkward Best Best

91 D2 9 for Linux, UNIX, and Windows, Sixth Edition 519 Table 8 1 XML API Summary Feature, Function, Benefit SQL SQL/XML XQuery XQuery & SQL/XML INSERT, UPDATE, DELETE Best Best N/A N/A Parameter Markers Good Best N/A N/A Full Text Search Good Best N/A Best XML Aggregation N/A Best Awkward Awkward Function Calls Best Best N/A Best Generally, the use of just SQL is only useful when retrieving a full document or fragment from a purexml column (in which case you may want to have this information reside in a LOB depending on the data within it and your application s access patterns). The reason that solely using SQL is so limiting is that you can t select data based on the contents of the XML document or fragment, just the relational columns (if they exist) that surround it. When SQL/XML is used as the top query language (and optionally mixed with XQuery) you get the richest functionality when it comes to querying the data with the least amount of restrictions. For example, with this method you can extract XML fragments from a purexml column, use full-text searching, aggregate or group the data, and more. For the most part, this is the generally recommended method for purexml data retrieval. Even if you don t need all of the capabilities this approach provides, it leaves the door open for future application flexibility. As previously mentioned, XQuery is a language developed specifically for the querying of XML data. If you have an application that only works with XML data, this seems like an obvious choice. Finally, you can embed SQL within an XQuery statement. This allows you to filter out some of the XML data using relational predicates on the columns that surround it in the schema. However, if you need to perform data analysis queries with grouping and aggregations, you may prefer SQL/XML. purexml

92 520 Chapter 8 purexml Storage Engine Updating and Deleting purexml Columns To update or delete entire XML documents or fragment in a purexml column, you can use traditional INSERT and UPDATE SQL statements. If you wanted to update an entire XML document in a purexml column, you could enter an UPDATE statement similar to the following (Example 8 17): UPDATE COMMUTERS SET CUSTOMER = XMLPARSE ( DOCUMENT '<?xml version="1.0" encoding="utf-8"?> <customerinfo xmlns=" xmlns:xsi=" xsi:schemalocation=" C:\Temp\Example\Customer.xsd"> <CanadianAddress> <Address>434 Rory Road</Address> <City>Toronto</City> <Province>Ontario</Province> <PostalCode>M5L1C7</PostalCode> <Country>Canada</Country> </CanadianAddress> </customerinfo>' ) WHERE ID=4; Example 8 17 Updating an XML column As you might expect, you delete rows with XML documents in the same manner as regular rows using the DELETE command (Example 8 18): DELETE FROM COMMUTERS WHERE ID=4 Example 8 18 Deleting a row with XML columns If you wanted to completely remove an XML document or fragment from a purexml column you would use the UPDATE command and set the value to NULL (Example 8 19): UPDATE COMMUTERS SET CUSTOMER=NULL WHERE ID=3 Example 8 19 Setting an XML value to NULL At the time when DB2 9 generally became available, the XQuery standard really only defined what its name implies: query. There was no ANSI standard method for updating or deleting fragments of an XML document. As you ve seen in the

93 D2 9 for Linux, UNIX, and Windows, Sixth Edition 521 previous examples, you can manipulate data in existing purexml columns, but only when you work on the entire document or fragment in a single operation. As XML proliferates into the mainstream, one should expect more and more operations to occur transactionally at child levels in the hierarchy. For example, a commuter in the working example could move to a new location in the same city this would likely require that the XML data be updated, but not all elements (you don t need to update <Country> or <Province>). If you considered these types of operations across a large number of documents (within a single column or spread across multiple purexml columns), the overhead of having to work with the entire document for fragment could be quite high. Because there was no defined standard for updating parts of an XML document or fragment when DB2 9 became generally available, a decision was made to provide the DB2XMLFUNCTIONS.XMLUPDATE stored procedure to offer the capability to update portions of an XML document or fragment in the interim. Although the source for this routine is provided for you, you are required to build and install this stored procedure. (It s not considered part of the purexml support, rather a work-around method until the XQuery standard evolves to support such activity.) Note This space should be watched closely. As the standard evolves, the reliance on this stored procedure may not be required. purexml Indexing purexml Columns Indexing support is available for data stored in XML columns. The use of indexes over XML data can improve the efficiency of queries issued against XML documents or fragments stored in a DB2 9 purexml column. As with a relational index, an index over XML data indexes the contents of the column. They differ, however, in that a relational index indexes an entire column, while an index over XML data indexes parts of the column. (It can also index the entire column.) You can define single or multiple XML indexes on a single purexml column in the same manner as a relational index. For DB2 XML indexes, you indicate which parts of the XML document stored in the column should be indexed. DBAs can specify parts of the XML document or fragment they want indexed using an XML Pattern expression (which is essentially a limited XPath expression).

94 522 Chapter 8 purexml Storage Engine Example 8 20 illustrates how an index is created on an XML column: CREATE INDEX XMLINDEX ON DEPT(DEPTDOC) GENERATE KEY USING XMLPATTERN'/dept/name' AS VARCHAR(30); Example 8 20 XML index creation DB2 purexml indexes are very powerful because they can index any element, attribute, or text within an XML document (or combination thereof) or the entire document itself (although you are not forced to do so). In addition, DB2 purexml indexes can index repeating elements. Figure 8 3 illustrated the complexities that arise when shredding XML data to relational format when the schema changes; the example was the addition of a new phone number to an employee s record. In that example, the XML document evolved such that it had multiple elements. DB2 9 could index these multiple elements. DB2 purexml indexes are very powerful with respect to this capability when compared to other database offerings and their XML support. The DDL used to create an XML index in DB2 9 is shown below (Figure 8 20): CREATE INDEX index-name ON table-name(xml-column-name) UNIQUE GENERATE KEY USING XMLPATTERN xmlpattern xmlpattern: / // AS SQL VARCHAR (integer) VARCHAR (HASHED) DOUBLE DATE TIMESTAMP element-tag * / Figure 8 20 XML Index creation syntax In the previous figure you can see that the actual search path differs from a traditional relational index. When you index a purexml column you index an XMLPAT- TERN that points to the location of the information you want, and then optionally modify the pointer such that it returns text, attributes, elements, and so on.

95 D2 9 for Linux, UNIX, and Windows, Sixth Edition 523 Note XMLPATTERN is a notation that s based on XPath, only it has no predicates. An XML- PATTERN expression is simply formulated with child axis (/) and descendent-or-self axis (//) operations. You ll also note the AS clause in the previous DDL, which specifies the data type to which indexed values are converted before they are stored. Values are converted to the index XML data type that corresponds to the specified index SQL data type, which helps the index manager perform fast and efficient searches on the XML data. More specifically, the AS clause is required because the DB2 9 engine was built for flexibility, and one of those flexible options is not to require an associated XML Schema Definition document to store your XML documents. Without an XML Schema Definition document there would be no way for DB2 9 to know the data type to use for the index for a specified XMLPATTERN expression. For example, you use AS SQL VARCHAR(X) for nodes to index values of a known maximum length, and AS SQL DATE and AS SQL TIMESTAMP for date-based nodes. Note The AS VARCHAR HASHED option is typically used when you don t know the length of the XML data, or for nodes whose lengths change frequently. This index will hash out the string values of your nodes. This may seem optimal, but these indexes won t support range predicate queries, just equality ones. purexml If you re indexing numeric data, use the AS SQL DOUBLE clause. For simplicity, the DB2 9 technology offers this single numeric data type for indexing XML numericbased data. The reason for this is simple: Instead of weighing down DBAs with the complexity of choosing between multiple numeric-based data types, it was deemed a better option to cast all numeric-based data into a DOUBLE data type. The consequence is that the DBA could lose precision with this data type, which would create the side effect of the inclusion of some elements that wouldn t otherwise be in the index if it were able to accommodate a more precise numeric value. The point here is that you ll always get the data you re looking for (and it ll be easier to define the structure to get it), though the index may contain more entries than needed. Your queries will filter out these results anyway and still get the data fast.

524 Chapter 8 purexml Storage Engine Consider the following XML fragment that describes books for an online retailer (Example 8 21): <book> <authors> <author id="74">george Baklarz</author> <author

96 524 Chapter 8 purexml Storage Engine Consider the following XML fragment that describes books for an online retailer (Example 8 21): <book> <authors> <author id="74">george Baklarz</author> <author id="85">paul Zikopoulos</author> <author id="15">roman Melnyk</author> </authors> <title>the Rise and Fall of White Knuckle Airlines</title> <price>35</price> <keywords> <keyword>business</keyword> <keyword>success</keyword> <keyword>failures</keyword> </keywords> </book> Example 8 21 XML book descriptor The following table shows some examples of the information that would be indexed from different XMLPATTERNS (Figure 8 21): Figure 8 21 XMLPATTERN examples If you wanted to index the entire commuter pool based on their Postal Codes (Example 8 4), you would build an XMLPATTERN expression found in Example 8 22: customerinfo/canadianaddress[2]/postalcode Example 8 22 Indexing an XML document based on PostalCode

97 D2 9 for Linux, UNIX, and Windows, Sixth Edition 525 Note The [2] notation in this XMLPATTERN expression actually indicates the second occurrence of the element. So this notation would actually index the Postal Code of the second Canadian address in this document. While the previous XMLPATTERN expression indexes the <PostalCode> element s second occurrence, you could index the actual text of any <PostalCode> using the text() function as follows: customerinfo/canadianaddress/postalcode/text() Example 8 23 Indexing all PostalCode occurrences As you might expect, it could become rather complex when trying to generate an XML index for a larger document. For this reason, it s often better to create XML indexes in DB2 9 using the Control Center or the Developer Workbench. These tools provide graphical tooling to inspect the hierarchical structure of an XML document (on the file system or within a DB2 purexml column) and build the index using graphical tools. For example, the following illustrates the process to build the customerinfo/ CanadianAddress[2]/PostalCode index on the working example using the Control Center. 1. Start the Control Center by selecting it from the Start menu or entering db2cc from the CLP. purexml 2. Select the Indexes folder, right-click, and select Create Index (Figure 8 22). The Create Index wizard opens.

98 526 Chapter 8 purexml Storage Engine Figure 8 22 Using the Control Center to create an index 3. Select the table you want to index by specifying its schema and name, select Yes in the XML columns options box, and click Next (Figure 8 23). Figure 8 23 Specifying table schema and name When you select a table to index that has an XML column, the XML column options box automatically appears. If you select the Yes radio button, it tells the Control Center that you want to build an XML-based index.

D2 9 for Linux, UNIX, and Windows, Sixth Edition 527 4.

99 D2 9 for Linux, UNIX, and Windows, Sixth Edition Select the column you want to create the XML index on (only columns of type XML are shown in the Available XML columns box), move it to the Select XML column box by clicking, then click Next (Figure 8 24). Figure 8 24 XMLPATTERN examples 5. Click Open Document and import the XML document you want indexed by selecting it from a column where it s already been inserted (the Use an XML document from the column on which the index is to be built radio button), or from the file system (the Use the following local XML instance document radio button), then click OK. The Document from the select XML column box is loaded with the selected XML document or fragment (Figure 8 25). purexml

100 528 Chapter 8 purexml Storage Engine Figure 8 25 Viewing document data to assist in index creation 6. Select the PostalCode entry in the second CanadianAddress element, click Add Index, select Selected element only, and click OK. The Indexes field is populated with the XMLPATTERN for this new index (Figure 8 26).

101 D2 9 for Linux, UNIX, and Windows, Sixth Edition 529 Figure 8 26 Selecting the PostalCode element You can define more than one index at a time when specifying the XMLPATTERN for an XML index. For example, if you wanted to create a second XML index that used the text() function, you would click Add index and select the Text children./text() option from the Add Index window. DB2 automatically generates the names of the XML indexes based on the elements you select. You can override this by double clicking on the Index name field and entering your own name. Since you can t change the name of an XML index after it s created, you should ensure that you specify this option at index creation time if the default names generated by DB2 aren t suitable for your environment. XML indexes, for the most part, cannot be altered after they are created. In addition, if you are dealing with larger XML documents, you can search through them by clicking Find node (Figure 8 27). purexml

102 530 Chapter 8 purexml Storage Engine Figure 8 27 Finding a node type When you re finished specifying all the XML indexes you want to create on your purexml column, click Finish. You should be able to see your indexes in the Control Center s Index Object View (Figure 8 28). Figure 8 28 Viewing indexes on the COMMUTERS table

D2 9 for Linux, UNIX, and Windows, Sixth Edition 531 Creating XML indexes is much easier using the Control Center than manually using the CLP unless you re really familiar with the XML document or

103 D2 9 for Linux, UNIX, and Windows, Sixth Edition 531 Creating XML indexes is much easier using the Control Center than manually using the CLP unless you re really familiar with the XML document or fragment you want to index. Even if you know the structure of the XML index, the Control Center s XML index creation facility gives you the opportunity to create multiple indexes very quickly. The indexing features associated with purexml are also tightly integrated into popular IDEs. Figure 8 28 shows the facility in Visual Studio 2005 that can be used to generate the exact same index without ever leaving the development environment: purexml Figure 8 29 Creating an index through Visual Studio 2005 Full-text XML Indexing The XMLPATTERN index is very useful to search subsets of an XML document or fragment (though as previously stated, it can be set to fully index the column s contents as well). In DB2 9, the DB2 Net Search Extender (DB2 NSE) is available free for all editions (it was a previously chargeable component in DB2 8). In DB2 9, the DB2 NSE has been enhanced for purexml and offers the ability to create a fully XML-aware index for an entire XML document which aids free-form linguistic searching.

104 532 Chapter 8 purexml Storage Engine You can also use the DB2 NSE to index partial documents, but it s likely that you re indexing text within large fragments for free-form search; for example, a book chapter in an entire book delivered as an XML file. For example, an NSE index on the COMMUTER table could look like (Example 8 24): CREATE INDEX IDX_COMMUTERS FOR TEXT ON COMMUTERS(CUSTOMERS) Example 8 24 Text index on COMMUTERS table The DB2 NSE provides a set of easy-to-use functions that are simply added to SQL statements to direct the DB2 run-time engine to invoke the usage of such an index. In addition, NSE provides complex search criteria such as synonym match, stemming, and so on. For example, you may want your query to use an advanced text search index built by the DB2 NSE that stems the forms of using Whitby in a postal address (common forms are: Whitby, Wtby, and Wby). The SQL for such a request could look like (Example 8 25): SELECT * FROM COMMUTERS WHERE CONTAINS (CUSTOMERS,'SECTIONS("/CanadianAddress/City") FUZZY FORM OF 42 "PATTERN" STEMMED FORM OF "Whitby"')=1 Example 8 25 NSE Search example Management of the NSE is integrated into the Control Center, which makes it even easier to use this free capability (Figure 8 30): Figure 8 30 Creating an NSE index through the Control Center

105 D2 9 for Linux, UNIX, and Windows, Sixth Edition 533 XML Schema Repository (XSR) DB2 9 doesn t just have industry leading support for XML, it also has a rich and flexible XML Schema repository (XSR) that can be used to store XML Schema Definition (XSD) documents for subsequent data validation on insertion into a DB2 9 database. When storing data in DB2 9, all of the XML documents must be well-formed. This is much different from validation. Validation says that an XML document conforms to a strongly typed document (the XSD document), while an XML document that is well-formed may not be validated, but it confirms to the W3C standards for a well-formed document (matching opening and closing tags, single root node, and more). Basically, you need to know that DB2 can only store well-formed XML documents (or fragments of XML for that matter) in a purexml column; however, whether that XML data is valid according to a specific XML Schema is totally up to you. This provides immense flexibility to store strongly typed (validated), typed (not validated by an XML Schema), and schema-less documents a key principal for any XML data server. What s more, the option to have a strongly typed, typed, or schema-less document stored in a purexml column can be applied on a row by row basis within the same table (another mostly unique feature compared to other relational vendor s technology). Finally, the XSR in DB2 9 supports what is referred to as schema evolution. The schema evolution concept considers that not all participants in an XML message may be at the same XML Schema version for your XML applications. For example, consider a brokerage clearing house that clears transactions for its clients. Perhaps the Canadian trading partner is at Version 1.1 of your XSD document while the U.S. trading house is at Version 1.2. DB2 gives you the ability to support schema evolution by allowing you to store XSD documents that are orthogonal and intersect. In other words, you can store multiple iterations of the same schema and apply them to each row. Again, this is unique in the industry other relational vendors don t support this concept, which severely limits the flexibility (the whole point of XML) of an XML solution and creates a severe burden on the DBA to excessively manage tables and policies to support the XML application. Before you can use an XML Schema Definition document to validate your XML data, you need to register it for use with DB2 in the XML Schema Repository (XSR). The ability to do this is part of the DB2 9 engine, and there are multiple interfaces to accomplish this task. For example, you can use the Control Center, the DB2 Developers Workbench, Visual Studio.NET, the DB2 command line processor (CLP), a stored procedure, and more, as shown in Figure purexml

106 534 Chapter 8 purexml Storage Engine Figure 8 31 Registering an XSD document in the DB2 9 XSR Obviously it stands to reason that you can t register an XSD document if you don t have one. In this chapter, we re trying to give you the basics of the purexml support in DB2 9. Typically, a DBA isn t the person writing the XSD document. It s safe to say that there are literally hundreds of tools you can use to create an XSD document. Since an XSD document is merely an XML document, it doesn t matter how it was created it can still be stored in DB2. DB2 9 has some integration features that make some toolsets easier to work with than others. For example, from the DB2 Developers Workbench, Oxygen XML Editor, Visual Studio (and others), you can use the integrated XML Schema Definition document editors and register them right from the design template. An example of using the native XSD editor in Visual Studio (which has nothing to do with DB2 at all) and registering it in the DB2 9 XSR with a single click is shown in Figure 8 32.

107 D2 9 for Linux, UNIX, and Windows, Sixth Edition 535 Figure 8 32 Using native tooling to compose XML Schema Definition documents So why would you want to store your XSD documents in the DB2 9 XSR beyond the obvious benefits of a simplified topology? When you store your XSD documents in the XSR, performance of validation can be improved significantly. Generally, validation is a heavy process and can slow down application performance, so anything you can do to make your application go faster is generally a good thing. DB2 9 XSR services help to improve the performance of validation because you don t have to locate these validation documents across a network (or file system, and so on, or other storage configuration) to perform validation. In addition, when you register an XSD document in the DB2 9 XSR, that document is stored in a parsed-out format with the catalog tables to assist in even faster validation. During the validation process of an XML document, DB2 9 annotates the nodes in the XML hierarchy with data rule information that resides in the XML Schema Definition document. In DB2 9, you can only validate XML documents against an XSD document. Support for document type definition (DTD) documents in DB2 9 is for entity resolution, not validation. Generally, today s XML application have evolved to solely use XSD documents for validation since they are much easier to work with (they are XML documents) and have gained industry and standards acceptance. purexml

108 536 Chapter 8 purexml Storage Engine XML Schema Definition Document Registration Example It s pretty easy to register XSD documents using some of the tooling that we ve shown you thus far in this chapter. However, if you have many XSD documents to register, you may want to do it programmatically. Let s assume you have a table created by the following data definition language (DDL): create table certtable (test xml) To register an XML Schema Definition document with the URI ibmpress.certprepbooks.com for the certtest.xsd whose schema identifier certtest was part of the database schema books using the DB2 CLP, you would enter the following: register xmlschema ' from 'file:///c:/certtest.xsd' as certtest.certtest complete After the XSD document is registered with DB2, you can validate your XML directly using the XMLVALIDATE function, as shown below: insert into certtable(test) values xmlvalidate (? According to xmlschema uri ' Of course, if you wanted to insert a document without validation, you could use a command similar to this: insert into certtable(test) values (?,?) Note You can use any of the DB2 toolsets, or integration points, to do this example through a GUI. We re just trying to illustrate different methods throughout this section of working with your XML artifacts. In fact, you can also perform any of these functions from a Java or.net application, and more. DB2 can also try to deduce an XSD for validation based on the XML document without you explicitly specifying it with the XMLVALIDATE function by guessing at its contents.

109 D2 9 for Linux, UNIX, and Windows, Sixth Edition 537 To attempt to validate your XML document without specifically pointing to an XML Schema Definition, you could enter a command similar to the following: insert into certtable(test) values (?,xmlvalidate(?)) As you can imagine, performing XML validation consumes CPU cycles on your data server. If you re managing a high-volume transactional data server, you shouldn t perform data validation unless it s really needed. Since the XSR is really just a section of the DB2 storage engine, you can interact with it and view the registered XML Schema Definition documents in a natural way. For example, the following simple query would return a list of all registered XSD documents: select * from syscat.xsrobjects You could find out what XML Schema Definition document was used to validate a specific document using a query similar to the following: select deptid, xsrobjects(test) from certtable where testid=704 Summary In this chapter, we have examined the new XML capabilities within DB2, including the new XML data type, the purexml implementation, and the SQL and XML features supporting this new data type. DB2 has supported XML structures for a number of releases. These functions allowed a developer to store and retrieve XML information from a DB2 database. While the DB2 extender was acceptable for XML manipulation, it had a number of limitations that reduced its usability and performance. The new purexml support in DB2 adds significant new capability to the database and makes the manipulation of XML much more efficient and powerful. purexml

110

111

112 Rebecca Bond Kevin Yeung-Kuen See Carmen Ka Man Wong Yuk-Kuen Henry Chan Understanding DB2 9 Security Understanding DB2 9 Security is the only comprehensive guide to securing DB2 and leveraging the powerful new security features of DB2 9. Direct from a DB2 Security deployment expert and the IBM DB2 development team, this book gives DBAs and their managers a wealth of security information that is available nowhere else. It presents real-world implementation scenarios, step-by-step examples, and expert guidance on both the technical and human sides of DB2 security. This book s material is organized to support you through every step of securing DB2 in Windows, Linux, or UNIX environments. You ll start by exploring the regulatory and business issues driving your security efforts, and then master the technological and managerial knowledge crucial to effective implementation. Next, the authors offer practical guidance on post-implementation auditing, and show how to systematically maintain security on an ongoing basis. Coverage includes establishing effective security processes, teams, plans, and policies Implementing identification and authentication controls, your first lines of defense DB2 in Windows environments: managing the unique risks, leveraging the unique opportunities Using the new Label Based Access Control (LBAC) of DB2 9 to gain finer-grained control over data protection Eencrypting DB2 connections, data in flight, and data on disk: step-by-step guidance Auditing and intrusion detection: crucial technical implementation details Using SSH to secure machine-to-machine communication in DB2 9 multi-partitioned environments Staying current with the latest DB2 security patches and fixes hardcover 432 pages Table of Contents Foreward Preface Acknowledgments About the Authors 1. The Regulatory Environment 2. DB2 Security The Starting Point 3. Understanding Identification and Authentication The First Line of Defense 4. Securing DB2 on Windows 5. Authorization Authority and Privileges 6. Label Based Access Control 7. encryption (Cryptography) in DB2 8. Ready, Set,Implement? 9. Database Auditing and Intrusion Detection 10. SSH for Data-Partitioning on UNIX Platforms 11. Database Security Keeping it Current 12. Final Thoughts: Security The Human Factor Appendixes A. Independent Security Packages B. Kerberos C. DB2 Audit Scope Record Layouts D. DB2 Audit Additional Documentation E. Security Considerations for DB2 F. glossary of Authorization ID G. LBAC-Related SYSCAT views H. Security Plug-In Return Codes I. Detailed Implementation for the Case Study in Chapter 3 Index

113 CHAPTER 11 Database Security Keeping It Current I f DB2 database security were your boss, she would constantly be asking, What have you done for me lately? FIRST WORDS Maintenance is a bore. If you own a home, you know that without preventive and necessary maintenance, the value of your home will decrease year after year. So, even though no one wants to do maintenance, it is a necessary duty to protect and preserve what you have. Would you really invest a million dollars in a home and never bother to keep it clean? It s the same concept with DB2. Database security can be a challenge to set up, but when completed, the task of keeping it secure is never ending. This chapter deals with those requisite and perpetual needs and gives hints and tips to help you maintain that secure environment that everyone has worked so hard to foster. DILIGENCE With every software product, there is the potential for discovery of previously unknown security vulnerabilities. These types of discoveries can be identified in several different ways. For example, IBM may identify vulnerabilities during additional development or testing. Outside security firms or persons who work with the product also may identify vulnerabilities. In a worstcase scenario, if there were a new type of attack or threat perpetrated or proposed, the act itself would also point toward a new vulnerability that would result in a mitigation effort. 309

310 Chapter 11 Database Security Keeping It Current Although vulnerabilities native to the DB2 product have, so far, been few, vigilance is always necessary, especially because the effort to gain

114 310 Chapter 11 Database Security Keeping It Current Although vulnerabilities native to the DB2 product have, so far, been few, vigilance is always necessary, especially because the effort to gain inappropriate database access is so tempting. Reading the numerous high-profile headlines documenting any of the recent security breaches resulting in lost or compromised personal information simply underscores the attractiveness of the data that is housed in databases to thieves, hackers, and disgruntled employees. If a vulnerability involving or threatening DB2 is identified that requires action or mitigation, IBM announces the vulnerability and provides a solution, either in the form of advice or through the release of a Fixpack. As a rule of good practice, it is important for the DB2 DBA to be aware of the issuance of an IBM DB2 alert or the release of a new Fixpack, regardless of whether the issue is security related. DB2 User Community Resources Fortunately, IBM provides several ways to be kept aware regarding critical information about DB2. To be proactive and make sure information is received as soon as it is released, sign up for ed alerts at www-306.ibm.com/software/support/einfo.html (see Figure 11.1). After you sign up, if new vulnerabilities are discovered, ed specifics will be sent directly to you via alert s so that you can take appropriate action immediately. Figure 11.1 Subscribing to DB2 alerts

Understanding DB2 9 Security 311 The DB2 support page provides a wealth of information and can be used to research any issue, including those related to security.

115 Understanding DB2 9 Security 311 The DB2 support page provides a wealth of information and can be used to research any issue, including those related to security. It is also a portal to download Fixpacks. You can reach the main DB2 support page at www-306.ibm.com/software/data/db2/udb/support/. The DB2 9 information center is another valuable tool for researching security settings (in addition to being a powerful tool for understanding DB2 in general). You can access it via boulder.ibm.com/infocenter/db2luw/v9/index.jsp?topic=/com.ibm.db2.udb.doc/welcome.htm. Another great source of information is provided via RSS (Really Simple Syndication) news reader feed. Using a news reader tool (many are freely available), you can keep current via pushed DB2 information. Although this is not necessarily limited to security information, it is a valuable tool to keep current on product information. To subscribe, navigate your Internet browser to www-306.ibm.com/software/data/db2/udb/support/. Choose the RSS option (news feed of new content), as shown in Figure 11.2, and follow the instructions to subscribe. Figure 11.2 Subscribing to the DB2 RSS news feed A free resource that posts articles of a security nature is the DB2 Magazine. This magazine is a premier resource for information specific to DB2 and provides informative articles written by individual contributors and by industry-recognized experts. You can access the DB2 Magazine site via In addition to the IBM resources, DBAs worldwide routinely review the DB2-L list maintained by the International DB2 Users Group (IDUG). This list is not exclusive to security issues, but if

116 312 Chapter 11 Database Security Keeping It Current you are a subscriber (a free registration is required), you can post questions regarding DB2 security or any other DB2 topic of interest. The URL is Security Information on the Web Beyond the normal DB2 informational resources, many forums and websites can prove to be extremely useful for learning about security and for discovering how to mitigate newly found vulnerabilities. Any search on a major search engine can provide a wealth of information. For a start, considering visiting the following (listed in no particular order): FIXPACKS Fixpacks hold the code that updates DB2 installations. IBM creates Fixpacks to update product features and make any necessary repairs to product code, whether security related or not. Keeping current with DB2 Fixpacks is one of the best ways to guard against security vulnerabilities and provides the extra benefit of making sure your DB2 installation is always at the most current product level. If a security vulnerability is discovered, IBM will recommend a product update via a Fixpack. Fixpacks are tied to the product version and operating system. For example, a Fixpack for Windows operating systems should not be applied to a DB2 installation on a UNIX system and vice versa. Each Fixpack includes all necessary documentation including information on prerequisites and typically includes Release Notes, a Readme file, and a detailed list of all items fixed through the application of the Fixpack. Each of the items fixed or upgraded via the Fixpack installation is known as an APAR (Authorized Program Analysis Report). Reviewing the APAR list will provide information on the items fixed in the Fixpack and, if security vulnerabilities are being mitigated via the Fixpack, they will be mentioned in this list.

117 Understanding DB2 9 Security 313 One word of caution is necessary regarding Fixpacks. Applying Fixpacks in production environments without proper testing may result in unwanted problems. It is always best to apply Fixpacks in development and test environments first, with an appropriate testing cycle to ensure that the new DB2 level does not conflict with the current production software. Most software vendors offer certification on certain code levels of DB2 that may lag behind the Fixpack release and make it unrealistic to apply a newly released Fixpack. The Fixpack go/no go decision becomes ever more difficult if security vulnerabilities have been mitigated in a Fixpack release, because the risk of not applying the Fixpack must be considered. Keep in mind that if the security vulnerability has a fix and that fix has been published (as in the case of a Fixpack release or an IBM alert), it is only a matter of time before some exploit of that vulnerability is introduced into the wild. KEEPING CURRENT GOES BEYOND DB2 Although the focus of this book is DB2 security, remember that few environments are simple these days. It is likely that any mid-to-large enterprise environment has more than one database product and multiple software applications. Throw in operating systems into the mix and there is sufficient complexity to keep any security junkie happy. Even if your role is only that of DB2 DBA, keeping an eye on the rest of the environment should be a consideration if possible. If you want to be a hero, saving the company from a security lapse should accomplish that goal, whether the save is DB2 related or not. Consider the entire architecture, not just the databases, when thinking about current and timely application of upgrades and fixes. Third-party applications and operating systems typically publish upgrade/bug fix information on their websites prior to public product or fix releases. The task to check these resources should be formally assigned to someone in the enterprise. Subscriptions ( or postal mail) to product security release information, available as a free service from most vendors, are one way to ensure that information on vulnerabilities and their appropriate fixes is received as it becomes available. Consider subscribing to s for all products used at your organization even if it is not your primary responsibility. The information may be redundant because someone else in the organization may have received it; but the reality is that with heavy workloads being a norm in many shops, not all s get someone s attention. Make it your job to see that fixable vulnerabilities do not become hacks, regardless of whether the vulnerability is directly related to your daily tasks. It is important to remember that the enterprise databases are actually the goal of most hacks regardless of whether they are initiated at the database level or via another application or method.

118 314 Chapter 11 Database Security Keeping It Current MIXED DATABASE ENVIRONMENTS If there is more than one database product in an enterprise, the task of database security takes on several additional facets. Not only will the DBA need to keep current on DB2 security and all the supporting technical architecture, but he will also potentially have to be a security expert on other database products. Because each database product brings new security challenges, managing a mixed database environment requires a heightened awareness of all the players, their plusses and minuses in the realm of security and their willingness to provide patches quickly should a new vulnerability be discovered. Much like breaking the string holding a necklace of pearls, it takes only a small amount of damage to cause some serious issues. Your database environment is most likely only as secure as your least-secure database product. If you are a DBA tasked with managing databases in a mixed environment and want to maintain the strongest possible security architecture, consider these points: Consolidate database products wherever possible. Do the analysis necessary to determine the smallest common denominator of database products that will provide functionality for all enterprise applications. If you support three database products, eliminating one will reduce your exposure by a third, so this effort can be highly effective from a security standpoint. Eliminate databases that are not needed. Believe it or not, many companies actually keep databases around that are not being used. Attempt to discover and eliminate these. Do an inventory of databases and their owners. Make those owners justify keeping their databases around. Although this is a good idea even if only supporting one database product, it is especially useful in mixed database models because eliminating unused databases may also eliminate the need for multiple database product support. Make a strong case for standardizing on one database product. Standardizing on one database product may not prevent the need for extra product support, but not standardizing encourages extra product support. Do the research to determine the product that offers the best security for your organization and present a strong case to management. VULNERABILITY ASSESSMENTS As part of an active security awareness exercise, consider performing vulnerability assessments that include all managed databases on a regular basis. Alternatively, you can purchase services or applications that provide information and detail for areas that need security improvements.

119 Understanding DB2 9 Security 315 When performing a vulnerability assessment, you are attempting to look at your technical infrastructure in the same way that a hacker or disgruntled employee would view it. You are attempting to find flaws in the security defense, not just in the databases, but throughout the enterprise. Because DBAs are so involved in the day-to-day operations, it is hard for them to notice minor security flaws the way a hacker would. For this reason, it is usually best to either use purchased tools for this task or hire specialists to perform the vulnerability assessments. INTRUSION DETECTION Perimeter-based intrusion detection is typically the most deployed technological enterprise solution to discover attacks. Because managing this is not typically a DBA responsibility, it might not be immediately clear why this topic would be included in a chapter on keeping DB2 security current. If thinking of the data in the database as the target of an intrusion attempt, however, it becomes clear that the DBA has a major role in both being aware of attacks and preventing future attacks (keeping current). Perimeter-based security focuses on external attacks and is a strong component for any security architecture. Unfortunately, if this is the only approach taken, the possibility of an internal-based attack might be overlooked. If recent news reports are any indication, internal attacks are actually more prevalent than external ones. After all, what better way to download millions of credit card numbers than to secure a position at a financial institution with legitimate access to the information? All it takes is a few keystrokes from a SQL query to gain the information if the employee has trusted access to the data. When thinking about intrusion detection from the DBA standpoint, first realize that the discovery of an attack is after the attack. If an attack is underway, by definition, attack prevention has failed. When looked at from that viewpoint, now the goal shifts to terminating the attack as close to its initiation as possible. The good news is that DB2 offers several ways to aid the DBA in discovering intrusion attempts, both from internal and external sources. Of course, DB2 auditing (discussed in Chapter 9, Database Auditing and Intrusion Detection ) easily lends itself to aid in intrusion detection. The other mechanisms discussed in Chapter 9, such as triggers and stored procedures, can also be used. If you haven t already done so, read that information when thinking about setting up intrusion detection mechanisms. What might not be as obvious is that you can use DB2 snapshots to aid in intrusion detection, too. Although DB2 snapshots are typically used with a goal of performance improvement, they provide useful information that, with analysis, can aid in the security effort. One issue to consider when thinking about intrusion detection is that the intrusion will most likely present as an abnormality. Discovering abnormalities is possible only when you understand the norms, so just issuing snapshots, hoping for some needle in the haystack, is not an

120 316 Chapter 11 Database Security Keeping It Current effective approach. Like most security processes, this one requires some analysis, maintenance, review, and change to keep it current. If there is a table in the database that has credit card numbers and addresses, for example, it might be a good table to study via snapshots. The TABLE snapshot may be highly valuable when used as a way to discover values outside of the normal ranges for table reads. If on Monday through Saturday, the number of rows read from the table is typically 8,000 and on Sunday the rows read is typically 1,000, you have the basis for norms. If those numbers begin to increase for reasons not related to business growth, you might have an abnormality that you should research. Another example is using snapshots to gather all dynamic SQL (although you have to be aware of a potential performance hit) in conjunction with setting the STATEMENT monitor switch. If the DBA knows, for instance, that a sensitive table is accessed only via a stored procedure launched by the application, but suddenly sees that an unrecognized or poorly formed SQL statement has been running against that table, discovery action could be initiated. Obviously, the DB2 audit facility provides functionality similar to the information that can be gathered using snapshots, but because not all corporations run a DB2 audit, or don t fully utilize it, snapshots can provide some measure of intrusion detection discovery when properly thought out and initiated. To learn more about all the functionality of DB2 snapshots, including how to set the necessary monitor switches for information gathering, see the DB2 Information Center at lcome.htm or read about the topic in the DB2 Command Reference. PREPARE FOR A BREACH If a breach occurs, despite all your valiant efforts, don t panic. Undoubtedly, your heart rate and blood pressure will rise, but giving in to despair will not enable the type of response that is necessary. As with most situations in life, if you are prepared for any eventuality, it is easier to achieve a positive outcome. Prepare yourself and your organization now for the possibility of a breach. The best way to prepare for dealing with a hack is to practice a response. A highly effective way to ensure preparedness is to simulate a response to a breach of highly sensitive data. Before beginning this type of trial run, it is a good idea to create a checklist of items that need to be completed in the event of a breach. During the exercise, someone should function as an observer, noting any items that are not on the checklist (but should be), items that should be improved, items that should be removed, and items that are effective as written.

121 Understanding DB2 9 Security 317 In preparing for a response to a breach, preparation topics often overlooked include the following: Keep a current list of all individuals who must be contacted in the event of a breach. If a breach occurs, who would you call? Do you have current contact information for those individuals? Should you call each person on the list or delegate that task to others? Who should be called first, second, and so on? How is the list to be stored? If electronically, what happens if the system is so compromised that the contact list cannot be accessed? Is the list maintained in a current state? Is the list reviewed regularly? Is there a secondary call list for individuals who might not need to be initially involved, but whose expertise will be needed to validate systems after the initial triage? Also, consider that the corporation s attorneys, law enforcement, and regulatory bodies should be notified in most cases. What is the scope of the breach? A breach could be minor or a worst-case scenario. In recent news, we have seen occasions where thousands of credit card numbers have been stolen and cases where only minor intrusions occurred. Until the scope of the breach is determined, it is difficult to formulate an action plan. Determining the scope of the breach should be a top priority. Consider who should be involved in the discovery process. More than a small number of individuals may actually be detrimental to the process. For example, a breach might have caused the databases to go or be taken offline. To perform discovery, the DBAs are using terminal access. If there is limited availability of terminal access, individuals who do not have terminal access may actually deter the work of those who do. Keep this group of individuals as small as possible to effectively do the work required. This is one case where more is not better. Should the databases be taken offline? If the databases are still online after a breach, this could be a tough decision to make. If data was stolen, taking the database offline could prevent further intrusions, but such an action could also put a company out of business. A good approach is to meet with management prior to the simulation and formulate a list of guidelines that would best serve your organization. In most organizations, attorneys should also be involved in these types of discussions. PREPARING FOR THE FUTURE It would be great if each of us knew what security vulnerabilities were coming so that we could be prepared beforehand. Because that is not possible, however, a good plan is to make ourselves as knowledgeable about past and present threats as possible while considering how these threats might morph into even stronger future threats.

122 318 Chapter 11 Database Security Keeping It Current When implementing a new database architecture, some normal operations, such as contingency planning, can also be a valuable exercise if applied to potential future threats. Think about contingencies to mitigate today s concerns, as well as potential concerns five years forward. As an example of thinking forward for security, when creating a new database environment, the DB2 DBA may implement Label Based Access Control (LBAC), even though there is no formal requirement to do so. Although extra work might be required, the DBA knows that LBAC is inherently more secure than the official requirement and that database security will benefit now and in the future. The cost/benefit ratio might not be evident now but might be very evident at some point in the future. Consider planning a formal time on a quarterly or more-frequent basis, to do an evaluation of security as it applies to DB2 and the technical architecture in place in your organization. Approach your personal security education as a life-long learning experience. Ask questions, read newly released books and articles, study the news, and remain on guard so that the future doesn t catch you unprepared to protect your databases. LAST WORDS Security is not just a function of the original setup and installation. It is an ongoing task. It is important that those tasked with database security keep a focus on current security, an understanding of the successes and failures of the past, and an ability to look toward the future. This involves continuous diligence, a curiosity that inspires a commitment to continuous learning cycles, and an approach that includes keeping up with current events and news about security vulnerabilities.

123

124 Zamil Janmohamed, Clara Liu, Drew Bradstock, Raul F. Chong, Michael Gao, Fraser McArthur, Paul Yip DB2 SQL PL Essential Guide for DB2 UDB on Linux, UNIX, Windows, i5/os, and z/os, Second Edition DB2 SQL PL, Second Edition shows developers how to take advantage of every facet of the SQL PL language and development environment. The authors offer up-to-the-minute coverage, best practices, and tips for building basic SQL procedures, writing flow-of-control statements, creating cursors, handling conditions, and much more. Along the way, they illuminate advanced features ranging from stored procedures and triggers to user-defined functions. The only book to combine practical SQL PL tutorials and a detailed syntax reference, DB2 SQL PL, Second Edition draws on the authors unparalleled expertise with SQL PL in real business environments. Coverage includes using SQL PL to improve manageability and performance, while clearly separating DBA and development roles Writing more efficient stored procedures, triggers, user-defined functions (UDFs), and dynamic compound SQL identifying SQL PL performance bottlenecks and resolving them Leveraging new language enhancements for Windows, UNIX, and Linux: improved table function support, session based locking, nested save points, new prepare options, and more Uusing new features for iseries V5R3: built-in string and date/time manipulation functions, SEQUENCE objects, and more utilizing zseries Version 8 s integrated stored procedures debugging and improved SQL Conditions support Mastering DB2 Development Center, the unified development environment for creating DB2 stored procedures Whether you re developing new SQL PL applications, migrating or tuning existing applications, or administering DB2, you ll find this book indispensable hardcover w/cd-rom 576 pages Table of Contents 1. Introduction. 2. Basic SQL Procedure Structure. 3. overview of SQL PL Language Elements. 4. using Flow of Control Statements. 5. understanding and Using Cursors and Result Sets. 6. Condition Handling. 7. Working with Dynamic SQL. 8. Nested SQL Procedures. 9. user-defined Functions and Triggers. 10. Leveraging DB2 Application Development Features. 11. Deploying SQL Procedures, Functions, and Triggers. 12. Performance Tuning. 13. Best Practices. Appendix A. getting Started with DB2. Appendix B. inline SQL PL for DB2 UDB for Linux, UNIX, and Windows. Appendix C. Building from the Command Line. Appendix D. using the DB2 Development Center. Appendix E. Security Considerations in SQL Procedures. Appendix F. DDL. Appendix G. Additional Resources. Appendix H. Sample Application Code. Index.

125 C H A P T E R 3 Overview of SQL PL Language Elements In this chapter, you will learn DB2 data types and the range of their values How to work with large objects How to choose proper data types How to work with user-defined data types (UDTs) How to manipulate date, time, and string data How to use generated columns How to work with SEQUENCE objects and IDENTITY columns Now that you have learned the basic DB2 SQL procedure structure, it is time for an overview of the DB2 SQL PL language elements and their usage before discussing any of the more advanced features of the language. Many decisions on the topics covered in this chapter such as the choices of the proper data types and the usages of SEQUENCE objects, IDENTITY columns, and generated columns are generally the tasks performed during database setup and table creation. Choices of the data types for parameters and local variables in the SQL procedures, User-Defined Functions (UDFs) and triggers, which are covered extensively in the rest of the book, mostly need to match the column definition in your underlying tables. 37

126 38 Chapter 3 Overview of SQL PL Language Elements DB2 Data Types A data type tells you what kind of data can be saved in a column or in a variable, and how large the value may be. There are two categories of data types in DB2: Built-in data types User-defined data types Valid DB2 Built-In Data Types and Their Value Ranges The built-in data types are provided with DB2. DB2 supports a wide range of data types for your business need. A summary of DB2 built-in data types are shown in Figure 3.1. Data Types Row Identifier ROWID (iseries and zseries only) Numeric Integer SMALLINT INTEGER BIGINT (LUW and iseries only) Decimal DECIMAL NUMERIC String Datetime Floating Point Character Graphic (Double Byte) Binary DATE TIME TIMESTAMP REAL DOUBLE CHAR VARCHAR CLOB GRAPHIC VARGRAPHIC DBCLOB BINARY (iseries only) VARBINARY (iseries only) BLOB External Datalink (LUW and iseries only) Figure 3.1 DB2 built-in data types. NOTE LONG VARCHAR and LONG VARGRAPHIC data types are supported in DB2 for LUW for backward compatibility only. They are being deprecated, which means that these data types will not be supported in the future. Use VARCHAR and VARGRAPHIC instead.

127 DB2 SQL PL 39 DB2 for iseries and zseries supports the ROWID data type. A ROWID data type is one that uniquely identifies a row. A query that uses ROWID navigates directly to the row because the column implicitly contains the location of the row. When a row is inserted into a table, DB2 generates a value for the ROWID column, unless one is supplied. If it is supplied, it must be a value that was previously generated. The value of ROWID cannot be updated and does not change, even after table space reorganizations. There can only be one ROWID column in a table. There are six numeric data types in DB2. Their precisions and value ranges are listed in Table 3.1. Table 3.1 DB2 Built-In Numeric Data Types Data Type Precision (Digits) Data Value Range SMALLINT 5 32,768 to 32,767 INTEGER 10 2,147,483,648 to 2,147,483,647 BIGINT 19 9,223,372,036,854,775,808 to 9,223,372,036,854,775,807 DECIMAL/NUMERIC LUW, LUW, zseries: Any value with 31 digits or zseries: 31 less. iseries: 63 iseries: Any value with 63 digits or less. REAL LUW, iseries: 24 LUW: zseries: 21 Smallest REAL value 3.402E+38 Largest REAL value E+38 Smallest positive REAL value E-37 Largest negative REAL value 1.175E-37 iseries: Smallest REAL value 3.4E+38 Largest REAL value +3.4E+38 Smallest positive REAL value +1.18E-38 Largest negative REAL value 1.18E-38 zseries: Smallest REAL value 7.2E+75 Largest REAL value +7.2E+75 Smallest positive REAL value +5.4E 79 Largest negative REAL value 5.4E 79 DOUBLE 53 LUW: Smallest DOUBLE value E+308 Largest DOUBLE value E+308 Smallest positive DOUBLE value E 307 Largest negative DOUBLE value 2.225E 307 iseries: Smallest DOUBLE value 1.79E+308 Largest DOUBLE value +1.79E+308 Smallest positive DOUBLE value +2.23E 308 Largest negative DOUBLE value 2.23E 308

128 40 Chapter 3 Overview of SQL PL Language Elements Table 3.1 DB2 Built-In Numeric Data Types (continued) Data Type Precision (Digits) Data Value Range zseries: Smallest REAL value 7.2E+75 Largest REAL value +7.2E+75 Smallest positive REAL value +5.4E 79 Largest negative REAL value 5.4E 79 DB2 supports both single-byte and double-byte character strings. DB2 uses 2 bytes to represent each character in double-byte strings. Their maximum lengths are listed in Table 3.2. Table 3.2 DB2 Built-In String Data Types Data Type CHAR VARCHAR LONG VARCHAR (LUW only) CLOB GRAPHIC VARGRAPHIC DBCLOB BINARY (iseries only) VARBINARY (iseries only) BLOB Maximum Length LUW: 254 bytes iseries: 32,766 bytes zseries: 255 bytes LUW: 32,672 bytes iseries: 32,740 bytes zseries: 32,704 bytes 32,700 bytes 2,147,483,647 bytes LUW: 127 characters iseries: 16,383 characters zseries: 127 characters LUW: 16,336 characters iseries: 16,370 characters zseries: 16,352 characters 1,073,741,823 characters 32,766 bytes 32,740 bytes 2,147,483,647 bytes You can also specify a subtype for string data types. For example, CHAR and VARCHAR columns can be defined as FOR BIT DATA to store binary data. On iseries, other subtypes can be specified such as FOR SBCS DATA, FOR DBCS DATA, and CCSID. On zseries, other subtypes that can be specified are FOR SBCS DATA and FOR MIXED DATA. DB2 date and time data types include DATE, TIME, and TIMESTAMP. The TIMESTAMP data type consists of both the date part and the time part, while DATE and TIME data types only deal with the date and the time component, respectively. Their limits are listed in Table 3.3.

129 DB2 SQL PL 41 Table 3.3 DB2 Built-In Date Time Data Types Description Limits Smallest DATE value Largest DATE value Smallest TIME value 00:00:00 Largest TIME value 24:00:00 Smallest TIMESTAMP value Largest TIMESTAMP value The last data type in Figure 3.1, DATALINK, is used to work with files stored outside the database. It is not covered in this book. Large Objects Large Object (LOB) data types are used to store data greater than 32KB, such as long XML documents, audio files, or pictures (up to 2GB). Three kinds of LOB data types are provided by DB2: Binary Large Objects (BLOBs) Single-byte Character Large Objects (CLOBs) Double-Byte Character Large Objects (DBCLOBs) You will need to take into account some performance considerations when dealing with LOBs. Refer to Chapter 12, Performance Tuning, for more details. LOBs can be used as parameters and local variables of SQL procedures. Figure 3.2 demonstrates a very simple usage of LOBs and returns a CLOB to the stored procedure caller. CREATE PROCEDURE staffresume ( IN p_empno CHAR(6), OUT p_resume CLOB(1M) ) LANGUAGE SQL SPECIFIC staffresume -- applies to LUW and iseries -- WLM ENVIRONMENT <env> -- applies to zseries BEGIN SELECT resume INTO p_resume FROM emp_resume WHERE empno=p_empno AND resume_format = 'ascii'; INSERT INTO emp_resume ( empno, resume_format

130 42 Chapter 3 Overview of SQL PL Language Elements END VALUES, resume ) ( p_empno, 'backupcopy', p_resume ); Figure 3.2 SQL procedure STAFFRESUME. Choosing Proper Data Types Choosing the correct data type is a simple and yet important task. Specifying the wrong data type may result in not only wasted disk space but also poor performance. To choose the correct data type, you need to fully understand your data and their possible values and usage. Table 3.4 offers a checklist for data type selection. Table 3.4 Question Simple Data Type Checklist Data Type Is the string data variable in length? If the string data is variable in length, what is the maximum length? Do you need to sort (order) the data? Is the data going to be used in arithmetic operations? Does the data element contain decimals? Is the data fixed in length? Does the data have a specific meaning (beyond DB2 base data types)? Is the data larger than what a character string can store, or do you need to store non-traditional data? VARCHAR VARCHAR CHAR, VARCHAR,NUMERIC DECIMAL, NUMERIC,REAL, DOUBLE,BIGINT, INTEGER,SMALLINT DECIMAL, NUMERIC,REAL, DOUBLE CHAR USER DEFINED TYPE CLOB, BLOB, DBCLOB TIP Unnecessary casting can cost performance. Try to define the variables in the SQL procedures with the same data types as the underlining table columns. REAL, FLOAT, and DOUBLE are imprecise data types where rounding may occur. You should not use these data types for storing precise data, such as primary key values or currency data.

131 DB2 SQL PL 43 Working with User-Defined Distinct Types User-defined distinct types are simple user-defined data types (UDTs) which are defined on existing DB2 data types. DB2 also supports other kinds of UDTs, which are beyond the scope of this book. In this book, UDT is only used to refer to the user-defined distinct type. UDTs can be used to give your data semantic meaning. The syntax of creating UDTs is shown in Figure 3.3. >>-CREATE DISTINCT TYPE--distinct-type-name--AS > >-- source-data-type --WITH COMPARISONS >< Figure 3.3 CREATE DISTINCT TYPE syntax. The source-data-type can be any DB2 built-in data type discussed in this chapter. The WITH COMPARISONS clause allows you to use system-provided operators for source data types on your UDTs. The WITH COMPARISONS clause is not allowed with BLOB, CLOB, DBCLOB, LONG VARCHAR, LONG VARGRAPHIC, or DATALINK source data types. NOTE In DB2 UDB for iseries and zseries, the WITH COMPARISON clause is optional. Comparison operator functions will be created for all allowed source data types except for the DATALINK. You can use the UDTs to enforce your business rules and prevent different data from being used improperly because DB2 SQL PL enforces strong data typing. Strong data typing requires more explicit casting when comparing different data types because the data types are not implicitly cast. To show you an example, suppose you define the two following variables: DECLARE v_in_mile DOUBLE; DECLARE v_in_kilometer DOUBLE; Nothing will prevent you from performing incorrect operations such as IF (v_in_mile > v_in_kilometer) This operation is meaningless because you cannot compare miles with kilometers without converting one of them first. But DB2 is unable to tell this. To DB2, both variables are floating-point numbers. It is perfectly normal to add them or directly compare them. UDTs can be used to prevent such mistakes.

132 44 Chapter 3 Overview of SQL PL Language Elements You can create two new data types: miles and kilometers. CREATE DISTINCT TYPE miles AS DOUBLE WITH COMPARISONS; CREATE DISTINCT TYPE kilometers AS DOUBLE WITH COMPARISONS; Then you can declare your variables using the UDTs instead: DECLARE v_in_mile miles; DECLARE v_in_kilometer kilometers; Now you will receive an SQL error SQL0401N The data types of the operands for the operation > are not compatible. LINE NUMBER=7. SQLSTATE=42818 if you try to execute the same statement: IF (v_in_mile > v_in_kilometer) If this error is somewhat expected, you might be surprised to learn that the following statement will also result in the same SQL error: IF (v_in_mile > 30.0) What is happening here? The answer is that DB2 requires you to explicitly cast both DOUBLE and kilometers data type to miles data type. When you create one user-defined distinct data type, DB2 generates two casting functions for you: one to cast from UDT to the source data type and another to cast back. In this example, for miles UDT, you have these two functions: MILES (DOUBLE) DOUBLE (MILES) Similarly, you have these two functions for kilometers UDT: KILOMETERS (DOUBLE) DOUBLE (KILOMETERS) In order for these two statements to work, they need to be rewritten using the casting functions: IF (v_in_mile > MILES(DOUBLE(v_in_kilometer)/1.6)) IF (v_in_mile > miles(30.0))

133 DB2 SQL PL 45 You have to cast the v_in_kilometers twice because there is no casting function between miles and kilometers unless you create it manually. The factor of 1.6 is added to convert kilometers into miles. Data Manipulation DB2 provides many built-in supports for data manipulation. Because of the complexity involved with manipulating date, time, and string data, it is particularly important to understand how to use system-provided features on these data types. Working with Dates and Times Dates and times are the data types that differ the most among Database Management Systems (DBMSs). This section shows you examples of some of the basic date and time manipulations. You can get the current date, time, and timestamp by using the appropriate DB2 special registers: SELECT CURRENT DATE FROM SYSIBM.SYSDUMMY1; SELECT CURRENT TIME FROM SYSIBM.SYSDUMMY1; SELECT CURRENT TIMESTAMP FROM SYSIBM.SYSDUMMY1; CURRENT DATE, CURRENT TIME, and CURRENT TIMESTAMP are three DB2 special registers. Another useful DB2 special register for date and time operation is CURRENT TIMEZONE. You can use it to get the CURRENT TIME or CURRENT TIMESTAMP adjusted to GMT/CUT. All you need to do is to subtract the CURRENT TIMEZONE register from the CURRENT TIME or CURRENT TIMESTAMP: SELECT CURRENT TIME - CURRENT TIMEZONE FROM SYSIBM.SYSDUMMY1; SELECT CURRENT TIMESTAMP - CURRENT TIMEZONE FROM SYSIBM.SYSDUMMY1; Given a date, time, or timestamp, you can extract (where applicable) the year, month, day, hour, minutes, seconds, and microseconds portions independently using the appropriate function: SELECT YEAR (CURRENT TIMESTAMP) FROM SYSIBM.SYSDUMMY1; SELECT MONTH (CURRENT TIMESTAMP) FROM SYSIBM.SYSDUMMY1; SELECT DAY (CURRENT TIMESTAMP) FROM SYSIBM.SYSDUMMY1; SELECT HOUR (CURRENT TIMESTAMP) FROM SYSIBM.SYSDUMMY1; SELECT MINUTE (CURRENT TIMESTAMP) FROM SYSIBM.SYSDUMMY1; SELECT SECOND (CURRENT TIMESTAMP) FROM SYSIBM.SYSDUMMY1; SELECT MICROSECOND (CURRENT TIMESTAMP) FROM SYSIBM.SYSDUMMY1; You can also extract the date and time independently from a timestamp: SELECT DATE (CURRENT TIMESTAMP) FROM SYSIBM.SYSDUMMY1; SELECT TIME (CURRENT TIMESTAMP) FROM SYSIBM.SYSDUMMY1;

134 46 Chapter 3 Overview of SQL PL Language Elements The date and time calculations are very straightforward: SELECT CURRENT DATE + 1 YEAR FROM SYSIBM.SYSDUMMY1; SELECT CURRENT DATE + 3 YEARS + 2 MONTHS + 15 DAYS FROM SYSIBM.SYSDUMMY1; SELECT CURRENT TIME + 5 HOURS - 3 MINUTES + 10 SECONDS FROM SYSIBM.SYSDUMMY1; DB2 also provides many date and time functions for easy date and time data manipulation. For a complete list, refer to the SQL Reference corresponding to your platform. A few date and time functions are used here as examples to show you how you can work with date and time data in DB2. To calculate how many days there are between two dates, you can subtract dates as in the following: SELECT DAYS (CURRENT DATE) - DAYS (DATE(' ')) FROM SYSIBM.SYSDUMMY1; If you want to concatenate date or time values with other text, you need to convert the value into a character string first. To do this, you can simply use the CHAR function: SELECT CHAR(CURRENT DATE) FROM SYSIBM.SYSDUMMY1; SELECT CHAR(CURRENT TIME) FROM SYSIBM.SYSDUMMY1; SELECT CHAR(CURRENT TIME + 12 HOURS) FROM SYSIBM.SYSDUMMY1; To convert a character string to a date or time value, you can use: SELECT TIMESTAMP (' ') FROM SYSIBM.SYSDUMMY1; SELECT TIMESTAMP (' :00:00') FROM SYSIBM.SYSDUMMY1; -- For LUW, zseries --SELECT TIMESTAMP ' :00:00' FROM SYSIBM.SYSDUMMY1; -- For iseries SELECT DATE (' ') FROM SYSIBM.SYSDUMMY1; SELECT DATE ('10/20/2002') FROM SYSIBM.SYSDUMMY1; SELECT TIME ('12:00:00') FROM SYSIBM.SYSDUMMY1; SELECT TIME (' ') FROM SYSIBM.SYSDUMMY1; Working with Strings String manipulation is relatively easy compared with date and timestamps. Again, DB2 built-in functions are heavily used. A few of them are used in this section to show you how DB2 string operations work. For a complete list, refer to the SQL Reference corresponding to your platform. You can use either the CONCAT function or the operator for string concatenation. The following two statements are exactly the same: SELECT CONCAT('ABC', 'DEF') FROM SYSIBM.SYSDUMMY1; SELECT 'ABC' 'DEF' FROM SYSIBM.SYSDUMMY1;

135 DB2 SQL PL 47 However, when you have more than two strings to concatenate, the operator is much easier to use. You might have to use UPPER or LOWER function in string comparisons if you want the comparison to be case-insensitive. DB2 string comparison is case-sensitive. COALESCE is another frequently used string function. It returns the first argument that is not null. In your application, if you have the following query SELECT coalesce(c1, c2, 'ABC') FROM t1; assuming the c1 and c2 columns of table T1 are both nullable character strings, you will receive the value of c1 if it is not null. If c1 is null, you will receive the value of c2 if it is not null. If both c1 and c2 contain null values, you will receive the string ''ABC'' instead. Working with Generated Columns LUW allows a column to be declared as a generated column. It is a column that derives the values for each row from an expression, and is used to embed your business logic into the table definition. The syntax for generated columns is shown in Figure 3.4. Generated columns have to be defined with either the CREATE TABLE or ALTER TABLE statements. ---column-name > '- data-type ' --+-GENERATED--+-ALWAYS AS--+-(--generation-expression--) '-BY DEFAULT-' Figure 3.4 Generated column syntax for LUW. Values will be generated for the column when a row is inserted into the table. Two options are supported, namely GENERATED ALWAYS and GENERATED BY DEFAULT. For a GENERATED ALWAYS identity column, DB2 has full control over the values generated, and uniqueness is guaranteed. An error will be raised if an explicit value is specified. On the other hand, the GENERATED BY DEFAULT option does not guarantee uniqueness. DB2 will only generate a value for the column when no value is specified at the time of insert. Figure 3.5 shows an example of a table using a generated column. CREATE TABLE payroll ( employee_id INT NOT NULL, base_salary DOUBLE, bonus DOUBLE, commission DOUBLE, total_pay DOUBLE GENERATED ALWAYS AS (base_salary*(1+bonus) + commission) ) Figure 3.5 An example of generated columns using a simple expression for LUW.

136 48 Chapter 3 Overview of SQL PL Language Elements In this example, there is a table named payroll in the department. Three columns are related to an employee's total pay, namely base_salary, bonus, and commission. The base_salary and commission are in dollars, and the bonus is a percentage of the base_salary. The total_pay is calculated from these three numbers. The benefit of using a generated column here is to perform pre-calculation before the query time and to save the calculated value in the column. If your application has to use the value frequently, using the generated column will obviously improve the performance. NOTE An alternative to using generated columns (applicable to all platforms) is presented in Figures 3.21 and 3.22 toward the end of this chapter. To insert a record into the payroll table, you can either use the DEFAULT keyword, as in INSERT INTO payroll VALUES (1, 100, 0.1, 20, DEFAULT); You could also not enumerate the column: INSERT INTO payroll (employee_id, base_salary, bonus, commission) VALUES (1, 100, 0.1, 20); Both will generate the same result. Because the column is always defined as generated, you cannot supply a real value for the total_pay column. If not all identity columns are specified in the INSERT statement, DB2 will automatically substitute them with default values according to the column definitions. It is a good practice to specify all the columns defined in the table and the associated values. This allows you to easily identify if there is a mismatch or missing column names and values. Notice how the reserved word DEFAULT is used so that DB2 will supply the default value for the generated column. NOTE DEFAULT is a DB2 reserved word. It is mandatory if a GENERATED ALWAYS column name is specified in an INSERT statement. Specifying all values for the columns in the VALUES clause of an INSERT statement is a good practice because it gives a clear view of what values are being inserted. The generation expression in Figure 3.5 is a very simple arithmetic formula. More logic could be built into it by using a CASE statement. The CASE statement will be discussed in detail in Chapter 4, Using Flow of Control Statements. For now, it is sufficient to know that a CASE statement checks conditions and chooses which statement to execute depending on the result. In the next example, the company has decided that each employee will be either a bonus employee or a commission employee, but not both. A bonus employee receives a base salary and a bonus. A

137 DB2 SQL PL 49 commission employee receives a base salary and a commission. A more complex table definition is shown in Figure 3.6. CREATE TABLE payroll2 ( employee_id INT NOT NULL, employee_type CHAR(1) NOT NULL, base_salary DOUBLE, bonus DOUBLE, commission DOUBLE, total_pay DOUBLE GENERATED ALWAYS AS ( CASE employee_type WHEN 'B' THEN base_salary*(1+bonus) WHEN 'C' THEN (base_salary + commission) ELSE 0 END ) ) Figure 3.6 An example of generated columns using a CASE expression for LUW. When the total pay is calculated, the employee type is checked first. If the type is B, indicating a bonus employee, the total pay is the total of the base salary and the bonus. If the type is C, indicating a commission employee, the total pay is calculated by adding the base salary and the commission. If a wrong employee type is entered, the total pay is set to 0, indicating a problem. Working with Identity Columns and Sequence Objects Numeric generation is a very common requirement for many types of applications, such as the generation of new employee numbers, order purchase numbers, ticket numbers, and so on. In a heavy online transaction processing (OLTP) environment with a high number of concurrent users, use of database tables and user-defined programmatic increment methods usually degrade performance. The reason is that the database system has to lock a table row when a value is requested to guarantee no duplicated values are used. The locks are discussed more in detail in Chapter 5, Understanding and Using Cursors and Result Sets. Instead of relying on your own methods for generating unique IDs, you can make use of facilities provided by DB2. DB2 provides two mechanisms to implement such sets of numbers: identity columns and sequence objects. As you explore the usage of identity columns and sequence objects, you will see that both of them achieve basically the same goal: automatically generating numeric values. Their behaviors can be tailored by using different options to meet specific application needs. Although they are created and used differently, DB2 treats both of them as sequences. An identity column is a system-defined sequence, and a sequence object is a user-defined sequence.

138 50 Chapter 3 Overview of SQL PL Language Elements A few SQL procedure examples will be used to demonstrate how to work with automatic numbering in DB2. In order to better illustrate the usage, some of the procedures use DB2 SQL PL features covered in the following chapters. Identity Column An identity column is a numeric column defined in a table for which the column values can be generated automatically by DB2. The definition of an identity column is specified at table creation time. Existing tables cannot be altered to add or drop an identity column. Figure 3.7 shows the syntax of an identity column clause used in a CREATE TABLE statement. Only one column in a table can be defined to be an identity column. ---column-name > '- data-type ' ---'-GENERATED--+-ALWAYS AS--+-IDENTITY '-BY DEFAULT-' '- identity-attributes -' identity-attributes: V '-( START WITH--+-numeric-constant )-' INCREMENT BY--+-numeric-constant-+-+.-NO MINVALUE MINVALUE--numeric-constant NO MAXVALUE MAXVALUE--numeric-constant NO CYCLE CYCLE CACHE NO CACHE '-CACHE--integer-constant-'.-NO ORDER-. '-+-ORDER ' Figure 3.7 Syntax of the identity column clause. Data types for identity columns can be any exact numeric data type with a scale of zero such as SMALLINT, INTEGER, BIGINT, or DECIMAL. Single and double precision floating-point data types

139 DB2 SQL PL 51 are considered to be approximate numeric data types, and they cannot be used as identity columns. NOTE In zseries, any column defined with the ROWID data type will default to GENERATED ALWAYS. Within the IDENTITY clause, you can set a number of options to customize the behavior of an identity column. Before discussing these options, let s look at Figure 3.8 to see how a table can be created with an identity column. CREATE TABLE service_rq ( rqid SMALLINT NOT NULL CONSTRAINT rqid_pk PRIMARY KEY -- (1), status VARCHAR(10) NOT NULL WITH DEFAULT 'NEW' CHECK ( status IN ( 'NEW', 'ASSIGNED', 'PENDING', 'CANCELLED' ) ) -- (2), rq_desktop CHAR(1) NOT NULL WITH DEFAULT 'N' CHECK ( rq_desktop IN ( 'Y', 'N' ) ) -- (3), rq_ipaddress CHAR(1) NOT NULL WITH DEFAULT 'N' CHECK ( rq_ipaddress IN ( 'Y', 'N' ) ) -- (4), rq_unixid CHAR(1) NOT NULL WITH DEFAULT 'N' CHECK ( rq_unixid IN ( 'Y', 'N' ) ) -- (5), staffid INTEGER NOT NULL, techid INTEGER, accum_rqnum INTEGER NOT NULL -- (6) GENERATED ALWAYS AS IDENTITY ( START WITH 1, INCREMENT BY 1, CACHE 10 ), comment VARCHAR(100)) Figure 3.8 Example of a table definition with an identity column Figure 3.8 is the definition of a table called service_rq, which will be used in a later sample. The service_rq table contains an identity column called accum_rqnum, shown in Line (6). Note that the GENERATED ALWAYS option is specified, and therefore DB2 will always generate a unique integer. The value of accum_rqnum will start at 1 and increment by 1.

140 52 Chapter 3 Overview of SQL PL Language Elements From examining the other column definitions (2, 3, 4, and 5), you will see that some are defined with a CHECK constraint so that only the specified values are allowed as column values. A primary key is also defined for this table, as shown in Line (1). NOTE For LUW and iseries, a unique index is automatically created when a primary key is defined, if one does not exist already. On zseries, you need to explicitly create the unique index associated with a primary key declaration. In DB2 for zseries, explicitly create a unique index on the primary key column: CREATE UNIQUE INDEX rqid_pk ON service_rq (rqid); -- zseries only Figures 3.9 and 3.10 show two different ways to insert a record into the service_rq table. INSERT INTO service_rq ( rqid, rq_desktop, rq_ipaddress, rq_unixid, staffid, comment ) VALUES ( 1, 'Y', 'Y', 'Y', 10, 'First request for staff id 10' ) Figure 3.9 First method of inserting into a table with an identity column. INSERT INTO service_rq ( rqid, status, rq_desktop, rq_ipaddress, rq_unixid, staffid, techid, accum_rqnum, comment )

141 DB2 SQL PL 53 VALUES ( 2, DEFAULT -- (1), 'Y', 'Y', 'Y', 10, NULL, DEFAULT -- (2), 'Second request for staff id 10' ) Figure 3.10 Second method of inserting into a table with an identity column. The use of the DEFAULT keyword in Figure 3.10 is the same as what has been discussed in the generated columns section. As shown in Figure 3.7, a few other options are available when defining the identity attribute. The START WITH option indicates the first value of the identity column and can be a positive or negative value. Identity values can be generated in ascending or descending order, and can be controlled by the INCREMENT BY clause. The default behavior is to auto-increment by 1 (and therefore, it is ascending). Options MINVALUE and MAXVALUE allow you to specify the lower and upper limit of the generated values. These values must be within the limit of the data type. If the minimum or maximum limit has been reached, you can use CYCLE to recycle the generated values from the minimum or maximum value governed by the MINVALUE and MAXVALUE option. The CACHE option can be used to provide better performance. Without caching, (by using option NO CACHE), DB2 will issue a database access request every time the next value is requested. Performance can be degraded if the insert rate of a table with an identity column is heavy. To minimize this synchronous effect, specify the CACHE option so that a block of values is obtained and stored in memory to serve subsequent identity value generation requests. When all the cached values in memory are used, the next block of values will be obtained. In the example shown in Figure 3.8, 10 values are generated and stored in the memory cache. When applications request a value, it will be obtained from the cache rather than from the system tables that are stored on disk. If DB2 is stopped before all cached values are used, any unused cached values will be discarded. After DB2 is restarted, the next block of values is generated and cached, introducing gaps between values. If your application does not allow value gaps, use the NO CACHE option instead of the default value of CACHE 20. Generate Value Retrieval It is often useful to be able to use the identity value previously generated by DB2 in subsequent application logic. The generated value can be obtained by executing the function IDENTITY_VAL_LOCAL within the same session of the INSERT statement; otherwise NULL is returned. The function does not

142 54 Chapter 3 Overview of SQL PL Language Elements take any parameters. Figure 3.11 demonstrates two different ways to use the IDENTITY_VAL_LOCAL function. CREATE PROCEDURE addnewrq ( IN p_rqid SMALLINT, IN p_staffid INTEGER, IN p_comment VARCHAR(100), OUT p_accum_rqnum INTEGER ) LANGUAGE SQL SPECIFIC addnewrq -- applies to LUW and iseries -- WLM ENVIRONMENT <env> -- applies to zseries BEGIN INSERT INTO service_rq ( rqid, status, rq_desktop, rq_ipaddress, rq_unixid, staffid, techid, accum_rqnum, comment ) VALUES ( p_rqid, DEFAULT, 'Y', 'Y', 'Y', p_staffid, NULL, DEFAULT, p_comment ) ; SELECT -- (1) identity_val_local() INTO p_accum_rqnum FROM sysibm.sysdummy1; END VALUES -- (2) identity_val_local() INTO p_accum_rqnum; Figure 3.11 Example of using IDENTITY_VAL_LOCAL. In Figure 3.11, procedure addnewrq uses two ways to obtain the value just inserted into service_rq. On Line (1), it uses the SYSIBM.SYSDUMMY1 table. Another method is to instead use the VALUES clause shown in Line (2). If you call the procedure multiple times with the same rqid value, as in CALL addnewrq(3, 1050, 'New Request',?)

143 DB2 SQL PL 55 you receive an error with SQLSTATE indicating a unique constraint was violated because the rqid column is defined as a primary key in Figure 3.8. Note that the result of IDENTITY_VAL_LOCAL keeps increasing even though the INSERT statement fails. This indicates that once an identity value is assigned by DB2, it will not be reused regardless of the success or failure of the previous INSERT statement. Notice that the example in Figure 3.11 only involves a single row insert. If the statement inserts multiple rows prior to execution of IDENTITY_VAL_LOCAL, it will not return the last value generated it will return NULL. NOTE In DB2 UDB for iseries, the IDENTITY_VAL_LOCAL function does not return a null value after multi-row inserts. It will return the last value that was generated. Consider the example in Figure CREATE PROCEDURE insert_multirow ( OUT p_id_generated INTEGER ) LANGUAGE SQL SPECIFIC insrt_multirow -- applies to LUW and iseries BEGIN INSERT INTO service_rq -- (1) ( rqid, staffid, accum_rqnum, comment ) VALUES ( 30000, 1050, DEFAULT, 'INSERT1') -- (2),( 30001, 1050, DEFAULT, 'INSERT2') -- (3) ; VALUES -- (4) identity_val_local() INTO p_id_generated; -- For clean up purpose DELETE FROM service_rq WHERE rqid = or rqid = 30001; END Figure 3.12 iseries. Example of a multi-row insert before IDENTITY_VAL_LOCAL for LUW and

144 56 Chapter 3 Overview of SQL PL Language Elements Two sets of values on Lines (2) and (3) are being inserted with a single INSERT statement separated by a comma. The output parameter p_id_generated is assigned to the result of the IDENTITY_VAL_LOCAL function at Line (4). Successfully calling insert_multirow will give you the following on LUW: P_ID_GENERATED: NULL "INSERT_MULTIROW" RETURN_STATUS: "0" On iseries, the result would be something similar to Output Parameter #1 = 5 Statement ran successfully (761 ms) NOTE In DB2 for zseries, the example shown in Figure 3.12 will not work because inserting two rows with one INSERT statement as shown is not supported. To insert multiple rows with one statement, use the INSERT INTO <table1> (<columns>) SELECT <columns> FROM <table2> statement; however, the IDENTITY_VAL_LOCAL function is not supported with this statement. Another alternative to insert multiple rows in zseries is using Dynamic SQL and host variable arrays. Change of Identity Column Characteristics Because an identity column is part of a table definition, to reset or change a characteristic of an identity column you need to issue an ALTER TABLE statement as shown in Figure COLUMN-. >>-ALTER TABLE--table-name---ALTER column-name------> V SET INCREMENT BY--numeric-constant >< +-SET--+-NO MINVALUE '-MINVALUE--numeric-constant-' +-SET--+-NO MAXVALUE '-MAXVALUE--numeric-constant-' +-SET--+-NO CYCLE '-CYCLE----' +-SET--+-NO CACHE '-CACHE--integer-constant-'

145 DB2 SQL PL 57 +-SET--+-NO ORDER '-ORDER----' '-RESTART ' '-WITH--numeric-constant-' Figure 3.13 Syntax and example of altering identity column characteristics. Except for the RESTART option (which has not been introduced), the options listed in Figure 3.13 behave exactly the same as they were described earlier in this chapter. If you want the identity column to be restarted at a specific value at any time, you will find the RESTART option very useful. Simply alter the table, provide the RESTART WITH clause, and explicitly specify a numeric constant. Sequence Object A sequence is a database object that allows automatic generation of values. Unlike an identity column that is bound to a specific table, a sequence is a global and stand-alone object that can be used by any table in the same database. The same sequence object can be used for one or more tables. Figure 3.14 lists the syntax for creating a sequence object..-as INTEGER >>-CREATE SEQUENCE--sequence-name---* *--> '-AS--data-type--' > * > '-START WITH--numeric-constant--'.-INCREMENT BY > * > '-INCREMENT BY--numeric-constant--'.-NO MINVALUE > * > '-MINVALUE--numeric-constant--'.-NO MAXVALUE NO CYCLE--. > * *----> '-MAXVALUE--numeric-constant--' '-CYCLE-----'.-CACHE NO ORDER--. > * *------>< +-CACHE--integer-constant--+ '-ORDER-----' '-NO CACHE ' Figure 3.14 Syntax of the CREATE SEQUENCE statement.

146 58 Chapter 3 Overview of SQL PL Language Elements As with identity columns, any exact numeric data type with a scale of zero can be used for the sequence value. These include SMALLINT, INTEGER, BIGINT, or DECIMAL. In addition, any userdefined distinct type based on of these data types can hold sequence values. This extends the usage of user-defined distinct types in an application. You may already notice that options supported for sequence objects are the same as the ones for identity columns. Refer to the previous subsection for their descriptions. CREATE SEQUENCE staff_seq AS INTEGER START WITH 360 INCREMENT BY 10 NO MAXVALUE NO CYCLE NO CACHE Figure 3.15 Example of sequence staff_seq. CREATE SEQUENCE service_rq_seq AS SMALLINT START WITH 1 INCREMENT BY 1 MAXVALUE 5000 NO CYCLE CACHE 50 Figure 3.16 Example of sequence service_rq_seq. Figure 3.15 and Figure 3.16 show the creation of two sequence objects. For example, the sequence staff_seq is used to provide a numeric ID for each staff member. It is declared as an INTEGER, starts at 360, is incremented by 10, and no maximum value is explicitly specified. It is implicitly bound by the limit of the data type. In this example, values generated are within the limit of an INTEGER data type. The NO CYCLE option indicates that if the maximum value is reached, SQLSTATE will be returned, which means that the values for the sequence have been exhausted. The second sequence object, shown in Figure 3.16, is defined as SMALLINT and used to generate ticket numbers for service requests. This sequence object will start at 1 and increment by 1. Because NO CYCLE is specified, the maximum value generated will be The CACHE 50 option indicates that DB2 will acquire and cache 50 values at a time for application use. Like identity columns, if DB2 is stopped and sequence values were cached, gaps in sequence values may result.

147 DB2 SQL PL 59 Change of Sequence Object Characteristics At any time, you can either drop and re-create the sequence object or alter the sequence to change its behavior. Figures 3.17 and 3.18 show the syntax of the ALTER SEQUENCE and DROP SEQUENCE statements, respectively. >>-ALTER SEQUENCE--sequence-name > V > RESTART >< '-WITH--numeric-constant--' +-INCREMENT BY--numeric-constant MINVALUE--numeric-constant '-NO MINVALUE ' +-+-MAXVALUE--numeric-constant '-NO MAXVALUE ' +-+-CYCLE '-NO CYCLE-' +-+-CACHE--integer-constant '-NO CACHE ' '-+-ORDER ' '-NO ORDER-' Figure 3.17 Syntax of the ALTER SEQUENCE statement. NOTE In DB2 UDB for iseries, the ALTER SEQUENCE statement also allows you to change the data type of the sequence in addition to the options listed in the syntax diagram in Figure RESTRICT-. --DROP--+-SEQUENCE--sequence-name < Figure 3.18 Syntax of the DROP SEQUENCE statement. Privileges Required for Using Sequence Objects Just like other database objects in DB2, manipulation of sequence objects is controlled by privileges. By default, only the sequence creator or a user with administrative authorities (such as SYSADM and DBADM on LUW), hold the ALTER and USAGE privileges of the object. If you want other users to be able to use the sequence, you need to issue the following: GRANT USAGE ON SEQUENCE <sequence_object_name> TO PUBLIC

148 60 Chapter 3 Overview of SQL PL Language Elements The USAGE and ALTER privileges can be granted to PUBLIC or any individual user or group. Generated Value Retrieval Two expressions, NEXT VALUE and PREVIOUS VALUE, are provided to generate and retrieve a sequence value. Figure 3.19 is an example of their usage. Two alternate expressions, NEXTVAL and PREVVAL, can be used interchangeably with NEXT VALUE and PREVIOUS VALUE, respectively, for backward compatibility reasons. CREATE PROCEDURE seqexp ( out p_prevval1 int, out p_nextval1 int, out p_nextval2 int, out p_prevval2 int ) LANGUAGE SQL SPECIFIC seqexp -- applies to LUW and iseries -- WLM ENVIRONMENT <env> -- applies to zseries BEGIN -- DECLARE host variables DECLARE v_prevstaffno INT; -- Procedure logic INSERT INTO staff ( id, name, dept, job, years, salary, comm ) VALUES ( NEXT VALUE FOR staff_seq, 'Bush', 55, 'Mgr', 30, NULL, NULL); UPDATE staff SET id = ( NEXT VALUE FOR staff_seq ) WHERE name='bush'; VALUES PREVIOUS VALUE FOR staff_seq INTO v_prevstaffno; -- (1) DELETE FROM staff WHERE id = v_prevstaffno; -- (2) VALUES -- (3) ( PREVIOUS VALUE FOR staff_seq

149 DB2 SQL PL 61 END, NEXT VALUE FOR staff_seq, NEXT VALUE FOR staff_seq, PREVIOUS VALUE FOR staff_seq ) INTO p_prevval1, p_nextval1, p_nextval2, p_prevval2; Figure 3.19 Usage of the NEXT VALUE and PREVIOUS VALUE expressions. You can use the NEXT VALUE and PREVIOUS VALUE expressions in SELECT, VALUES, INSERT, and UPDATE statements. In Figure 3.19 on Line (2), the DELETE statement needs to reference the value just generated in the WHERE clause. Because NEXT VALUE and PREVIOUS VALUE cannot be used in a WHERE clause, you need to use two separate SQL statements. You can use a VALUES INTO statement to obtain and store the generated value in a variable, v_prevstaffno. The DELETE statement can then specify the variable in the WHERE clause. The last VALUES statement on Line (3) in the example shows that if more than one sequence expression for a single sequence object is used in a statement, DB2 will execute NEXT VALUE and PREVIOUS VALUE only once. In the example, assuming the value last generated is 500, the statement on Line (3) will have the result 500, 510, 510, 500. For more examples on how to use sequence objects in your stored procedures, refer to Chapter 10, Leveraging DB2 Application Development Features. Platform Portability Considerations Database triggers can be used to achieve the same results as generated columns. Triggers are discussed in greater detail in Chapter 9, User-Defined Functions and Triggers. This section presents an alternative to the generated column example show in Figure 3.5. Figure 3.20 shows the alternate table creation script. CREATE TABLE payroll ( employee_id INT NOT NULL, base_salary DOUBLE, bonus DOUBLE, commission DOUBLE, total_pay DOUBLE ); Figure 3.20 Table creation for an alternative to a generated column.

150 62 Chapter 3 Overview of SQL PL Language Elements The column total_pay needs to be generated based on the values in the base_salary, bonus, and commission columns. Figure 3.21 shows the two triggers required to support this. CREATE TRIGGER bi_genpayroll NO CASCADE BEFORE INSERT ON payroll REFERENCING NEW AS n FOR EACH ROW MODE DB2SQL SET n.total_pay=n.base_salary*(1+n.bonus) + n.commission; CREATE TRIGGER bu_genpayroll NO CASCADE BEFORE UPDATE OF base_salary, bonus, commission ON payroll REFERENCING NEW AS n FOR EACH ROW MODE DB2SQL SET n.total_pay=n.base_salary*(1+n.bonus) + n.commission; Figure 3.21 Triggers for generated column logic. Summary In this chapter, DB2 data types were discussed. You learned all DB2 built-in data types and their valid values, which enable you to choose the right data type for your SQL procedure development. LOB data, date and time data, and string data and their manipulations were further demonstrated with examples. The DB2 user-defined distinct data type UDT was also introduced. You can use the UDT to have better control over the use of your data. Generated columns (for LUW), identity columns, and sequence objects were also covered. The values of generated columns are calculated and generated for you automatically by DB2. DB2 sequence object and identity columns are used to implement auto-incremental sequential numbers. Even though these features are normally defined at database setup time, you still may have to work with them in SQL procedures.

151

152 Scott Spangler Jeffrey Kreulen Mining the Talk Unlocking the Business Value in Unstructured Information In Mining the Talk, two leading-edge IBM researchers introduce a revolutionary new approach to unlocking the business value hidden in virtually any form of unstructured data from word processing documents to websites, s to instant messages. The authors review the business drivers that have made unstructured data so important and explain why conventional methods for working with it are inadequate. Then, writing for business professionals not just data mining specialists they walk step-by-step through exploring your unstructured data, understanding it, and analyzing it effectively. Next, you ll put IBM s techniques to work in five key areas: learning from your customer interactions; hearing the voices of customers when they re not talking to you; discovering the collective consciousness of your own organization; enhancing innovation; and spotting emerging trends. Whatever your organization, Mining the Talk offers you breakthrough opportunities to become more responsive, agile, and competitive. Identify your key information sources and what can be learned about them Discover the underlying structure inherent in your unstructured information Create flexible models that capture both domain knowledge and business objectives Create visual taxonomies: pictures of your data and its key interrelationships Combine structured and unstructured information to reveal hidden trends, patterns, and relationships Gain insights from informal talk by customers and employees Systematically leverage knowledge from technical literature, patents, and the Web Establish a sustainable process for creating continuing business value from unstructured data paperback 240 pages Table of Contents Preface Acknowledgements Chapter 1: Introduction Chapter 2: Mining Customer Interactions Chapter 3: Mining the Voice of the Customer Chapter 4: Mining the Voice of the Employee Chapter 5: Mining to Improve Innovation Chapter 6: Mining to See the Future Chapter 7: Future Applications Appendix: The IBM Unstructured Information Modeler Users Manual

153 5 Mining to Improve Innovation At first glance, it might not seem like Mining the Talk has much to do with business innovation. We usually think of innovation as springing spontaneously from a brilliant idea coming from an individual. Although this is often the case, innovation also comes from collaboration. More and more companies are finding that to innovate effectively, they need to partner with other organizations that have enabling technology. 1 Furthermore, even when innovations happen internally, bringing them to market often requires a detailed assessment of existing intellectual property space, research consensus surrounding related technology, and consumer receptiveness to the novel approach. Mining the Talk can help summarize all the information relevant to a new product introduction, enabling the business to make informed decisions about when a given innovation is market ready. Up to this point, we have focused on mining informal communications. Primarily, these have been more or less conversational in tone, similar to the ordinary give and take of everyday speech. Now we embark on more formal territory. In this section, we describe methods for mining technical descriptions or research articles in a very specific subject area. This type of talk is more like a lecture than a dialogue. Even though far more detailed and precise than the more informal documents we dealt with earlier, the documents in this section are still reasonably short and speak about one topic only. So, many 111

154 112 Mining the Talk Unlocking the Business Value in Unstructured Information of the same methods we used before will still apply here. Plus, we will add a few others that make sense in order to get the maximum value out of this very rich form of data. Business Drivers of Innovation Mining There is no single business driver of Innovation Mining that applies to all companies. Each company s innovation strategy is somewhat unique. Here are a few of the business drivers of Innovation Mining that we have come across: Having a set of research innovations that are sitting on the shelf because they lack a partner with the requisite enabling infrastructure that would allow them to succeed. The need for strategic market intelligence related to the technology landscape in order to look for possible white space where new innovation may succeed. Wanting to understand whether a given new technology is actually new or whether something similar has been done in the past by other businesses or research entities. Needing better sources of competitive intelligence the desire to understand and counteract your competitors innovation strategy. Locating who the experts are in a particular subject area, either within your own company or within the technical community at large. These are just a few examples of how businesses can use readily available information to help enhance innovation, if only they can Mine the Talk. Characteristics of Innovation Information As was mentioned before, innovation information is not exactly like customer interaction or VoC information. These documents are more formal in nature much more like a lecture than a conversation. Repositories of patents and research articles are publicly available data sources that can be mined for surprisingly detailed and comprehensive information on a wide array of technical subjects. To understand the process for mining such data, one must first understand how it is structured. We will begin with patents, such as those applied for and granted by the United States Patent & Trademark Office (USPTO). A patent for an invention is the grant of a property right to the inventor, issued by the Patent and Trademark Office. The term of a new patent is 20 years from the date on which the application for the patent was filed in the United States or, in special cases, from the date an earlier related application was filed, subject

155 5: Mining to Improve Innovation 113 to the payment of maintenance fees. U.S. patent grants are effective only within the U.S., U.S. territories, and U.S. possessions. The right conferred by the patent grant is, in the language of the statute and of the grant itself, the right to exclude others from making, using, offering for sale, or selling the invention in the United States or importing the invention into the United States. What is granted is not the right to make, use, offer for sale, sell, or import, but the right to exclude others from making, using, offering for sale, selling, or importing the invention. 2 Patents are focused documents that talk in detail about one specific idea. They are a special kind of talk that is designed to describe a particular invention for the purpose of claiming exclusive rights to the invention. By law, the description must be complete enough that anyone skilled in the art of the technical subject area of the invention could reproduce the invention from the text and drawings in the patent. A patent document has a somewhat structured format. Each patent document contains the same set of sections. These are as follows: 1. Title The title of the patent. 2. Abstract A concise statement of the technical disclosure including that which is new in the art to which the invention pertains. 3. Claims Define the invention and what aspects are legally enforceable. The specification must conclude with a claim particularly pointing out and distinctly claiming the subject matter that the applicant regards as his invention or discovery. The claim or claims must conform to the invention as set forth in the remainder of the specification, and the terms and phrases used in the claims must find clear support or antecedent basis in the description so that the meaning of the terms in the claims may be ascertainable by reference to the description. (See 37 CFR 1.58(a).) 4. Body The main section of the patent that describes the invention in detail along with any examples and figures. 5. References Also known as prior art. Contains pointers to previous patents or publications that are related to the invention. In addition, each patent has certain values associated with it: 1. Assignee(s) The owner of the patent. 2. Inventor(s) One who contributes to the conception of an invention. The patent law of the United States of America requires that the applicant in a patent application must be the inventor.

156 114 Mining the Talk Unlocking the Business Value in Unstructured Information 3. Main Class Patents are classified in the U.S. by a system using a three-digit class and a three-digit subclass to describe every similar grouping of patent art. A single invention may be described by multiple classification codes. The first three-digit class is also called the Main Class. 4. Subclass The second three digits in the classification. 5. Filed Date The date the application was received by the patent office. 6. Publish Date The subsequent date on which the patent office published the patent application. 7. Grant Date The date the patent was granted by the patent office. 3 Since 1977, over four million patents and applications have been published in electronic format. An electronic copy of this data can be purchased from the U.S. Patent Office. For companies who are serious about Innovation Mining, this is a worthwhile investment. Research articles, such as those found in Medline, 4 are structured in a similar fashion to patents, containing information such as publication date, title, abstract, authors, and references. Less formal data, such as that found more generally on the Web, may also be of use to gain insight as to public receptiveness or market readiness for a potential new product or service. What Can Innovation Mining Teach Us? Mining the Talk for Innovation is all about discovering hidden business relationships. For large organizations, these hidden relationships might easily exist internally, but every organization may have hidden external relationships whose potential is just waiting to be revealed. Mining the Talk for Innovation is also about how your company s innovation fits into the overall business ecosystem. What are the roadblocks, such as patents or other research, which might pre-date yours and be considered equal or superior? How receptive is the market likely to be to your innovation? Getting a clear picture of how your company and its products fit into the business landscape is the primary driver of Innovation Mining. One way to do this is by taking the documents that describe the company s products or inventions and comparing them to similar documents from related companies or organizations. Using Mining the Talk techniques, we can leverage text models generated for our company and the other business entities to locate interesting areas of overlap or similarity. These in turn may lead to knowledge or insight about the business that result in pursuit of some new product, research direction, or joint venture.

157 5: Mining to Improve Innovation 115 The Innovation Mining Process The Innovation Mining process expands upon the framework we ve already described to leverage unstructured information in the organization. The taxonomy creation and editing steps remain much the same, but more work is done both before the taxonomy is initially created and after editing of the taxonomy is complete. This is due to the fact that the data sets we are looking at are more extensive and generally more complex, and also because more detailed kinds of analysis are needed to discover all the potential hidden relationships that might exist between our company and other business entities. We therefore require a further refinement of our Mining the Talk methodology that we refer to as Explore, Understand, and Analyze. This is a process that utilizes the power of both SQL query to database and indexed search via a search engine in order to locate potential data of interest. That s the Explore. It then pulls the unstructured text content for the matching records into a taxonomy in order to allow the user to make sense out of what was found. The user looks at the categories and adjusts them using the methods we have outlined previously for visualizing and editing taxonomies. That s the Understand. Finally, the taxonomy is compared to other structured fields in the database using cooccurrence table analysis and trend analysis. That s the Analyze. Put it together, and you have a powerful method for discovering useful insights from patents, or any other data that is relevant to innovation (see Figure 5-1). Explore -Search -Query -Join Understand -Create Taxonomies -Edit Taxonomies Analyze - Apply Taxonomies - Trend -Correlate Figure 5-1 The Explore, Understand, Analyze process Understanding of Business Objectives and Hidden Business Concerns Understanding objectives is always the key first step in any Mining the Talk endeavor, and never more so than in Innovation Mining. Many factors come into play when considering possible business partnering relationships. There are usually historical issues such as previous agreements (or disagreements) between the companies. Then there are

158 116 Mining the Talk Unlocking the Business Value in Unstructured Information issues around the competitive landscape. For example, it may not make sense to partner with a company if this might raise anti-trust concerns. Companies, which analysis says may be perfectly good partnership candidates, may not actually be possible partners for reasons that are extraneous to the data. Furthermore, the available information is not always what it seems on the surface. Patents that were originally assigned to one company are later sold or licensed to others, and the publicly available information does not always reflect this. Some patents or technologies may be off limits for other internal strategic reasons. All of this information may be closely guarded and difficult to come by. Finding out as much as you can about the technology licensing environment as early as you can will save much wasted effort down the road. Demonstration of Capabilities One of the wonderful properties about the innovation information we discuss in this chapter is that it is both public and very strategic. This means that for almost any company with a significant history and technical track record, we can use Mining the Talk principles to quickly generate a taxonomy of their patent portfolio and compare this taxonomy across the landscape of their competition. One example of such a landscape would be a taxonomy of patents drawn from a selected industry with a structured co-occurrence table built from the assignee information from each patent. In almost every case, I have found the results of this analysis to communicate something important to the primary stakeholders about their business often something they either did not know, or something they thought that only they could know. Although this survey type analysis may not lead immediately to business impact, it does serve to make the stakeholders aware of the power of the technology when applied to this data, and gets them thinking about other applications in this space. Other analyses that can be done from such a collection include recent trends (what technologies are emerging in the industry) and white space analysis (what technical areas are not being concentrated on by any of the major players). Such a demonstration helps generate specific ideas around how Innovation Mining can be used most effectively to suit a given organization s business objectives. Survey of Available Data Aside from patent data, there are other public sources of information that are useful to mine for innovation. In health care and life sciences, there is the Medline database of research abstracts. This can provide a wealth of information on the latest studies and results in medicine and human health. Other databases of research publications are available from various research and technical associations such as the ACM or IEEE. In addition, many companies have internal databases related to invention disclosures, trade secrets, and research publications.

159 5: Mining to Improve Innovation 117 Data Cleansing and the ETL Process Because of the amount of data involved and the complexity of the content, the ETL process for building a data warehouse for Innovation Mining is a serious endeavor. Documenting all the details of the best practices for achieving clean data in this area could be the subject of another book, but suffice it to say that the data cleansing effort required for the U.S. patent corpus alone can be quite extensive. As an example, the assignee field for this data contains many inconsistencies and changes over time. In the long run, such problems need to be addressed. In the short term, the information can be put into a data warehouse in rough form and then refined gradually as time goes on, while still using it to generate value prior to achieving the desired state of cleanliness. Explore Selecting the Collection Explore is the first stage of the Innovation Mining process. Its purpose is to locate the specific documents that are relevant to a given Mining the Talk exercise. In previous chapters of this book, the Explore was essentially ignored because it was fairly obvious. For example, in the case of mining customer interactions, the Explore phase would have encompassed the extraction of all customer interactions for a given product over a given time period. For Voice of the Customer Mining, the Explore phase would have corresponded to finding all customer comments around a specific brand or company name. In both of these cases, the selection of what information to mine is readily apparent. The Innovation Mining Explore process may be much more involved. It makes use of multiple database queries and/or text index searches in combination. The problem is to identify documents that are both on topic and in the correct context. By on topic, we mean that they are relevant to the subject area under consideration (for example, we don t want railroad patents getting mixed in with our training methodology analysis). By correct context, we mean assigned to our company and also not expired, as an example. The basic operation of the Explore process is a search or query. A search is a keyword string that matches some word or phrase in the text of the document collection. All matching articles are returned and put in a result collection. A query is a database operation that returns all articles that match the value of a structured field. These are also returned and put in collections. These collections can then be further refined using additional queries or searches. Multiple collections can be merged by intersection or join operations to create additional collections. The final result is a single collection of documents whose unstructured information can be extracted and analyzed during the Understand phase.

160 118 Mining the Talk Unlocking the Business Value in Unstructured Information Understand The purpose for creating and refining the taxonomy in Innovation Mining is the same as in the other Mining the Talk applications to create categories that mirror the underlying business purpose of the analysis. However, since innovation is very much a discovery oriented process, taxonomy refinement may not require much category editing at all. It is only important that each category in the taxonomy be an actual meaningful concept that can be understood and will be useful during analysis. The knowledge gained in this phase may lead to the conclusion that some data is extraneous or that other data is missing from the collection created during the Explore phase. Any categories that are discovered during the Understand phase that are irrelevant to the subject of interest may be eliminated by removing the data corresponding to those categories from the taxonomy object. Additional data may need to be added if it turns out that while looking at categories, a possible new keyword query is discovered that might add significant relevant material. This return to the Explore phase may be an iteration that is repeated several times. Analyze The Analysis phase is a search for relevant trends and correlations between the taxonomy in comparison with structured information (or other taxonomies). In addition to looking for correlations between categories in a taxonomy and structured field values such as assignee, we would also look for correlations between structured field values and specific words or phrases in the unstructured text. This finer-grained analysis technique can detect specific concepts that are not frequent enough to be categories on their own. Correlating dictionary terms over time can lead to discovery of emerging technologies in a given industry. This in turn may lead to new categories being created in the taxonomy (returning to the Understand phase). As in other types of Mining the Talk, any correlations or trends that are discovered must be illustrated with typical examples from the document collection, in order to discern and verify the underlying cause of the co-occurrence. Presenting Results to Stakeholders Results for Innovation Mining consist of reports showing the company s position in the technology landscape. This includes high-level summaries showing counts of patents or research papers in a given field in comparison to other companies, with these counts broken down by key technologies. Beyond such high-level statistics, summaries of the most interesting patents or research documents in specific areas of interest should be shown, with the most relevant language from the document highlighted. In the case of research articles, an aggregate summary of research conclusions in a given area may be

161 5: Mining to Improve Innovation 119 sufficient. In the case of patents, specific patents and even specific claims within patents may need to be specified. The key in every presentation is to go as deep as necessary in bringing out the relevant data to validate your conclusions. Making Innovation Mining an Ongoing Concern The ongoing analysis of innovation information is the key to realizing the full innovation potential of the organization. We have developed the following process flow to illustrate roughly how Innovation Mining should be done across many different data sources and combined to produce a focused analysis to take to potential innovation partners (see Figure 5-2). Package Development Technology Valuation Technology Cluster Partner Identification Technology Identification and Decomposition Technology Landscape Analysis News and Web Patent Warehouse Internal Documents Figure 5-2 The ongoing Innovation Mining process In this ongoing process, data is analyzed as it emerges from news articles and newly released research on the Web. When significant new approaches or events happen that are flagged to be of interest (such as a hot new technology or a dispute over technology ownership issues in a field related to the business), then a technology decomposition and identification process takes place that mines the patent literature to find documents that are related to the technology in question. This might include both patents and pregranted patent applications. Internal research documents and invention disclosures within the company are then mined in combination with patents from other companies to fill out the technology landscape picture. All of these inputs are then combined into a cluster of patents and/or technologies whose individual value is then assessed, based on

162 120 Mining the Talk Unlocking the Business Value in Unstructured Information factors such as patent citation, internal rankings, estimates of discoverability, and typical licensing value for technologies in this area. Finally, a formal package is put together for presentation to the potential technology partner to make the case for a joint venture or licensing agreement of some kind. To do such Innovation Mining on an ongoing basis requires, first of all, continuous updating of public data sources, such as U.S. patents or Medline research articles. The fees, infrastructure, and overhead required for such ETL processes are not negligible, so the organization will need to consider whether there is enough justification from the potential benefits to incur this ongoing cost. Vendors who supply this information for a reasonable fee may be the answer. Beyond the data issues, the issue of becoming an ongoing part of the organization s innovation process is important. Innovation Mining needs to be inserted as a regular part of the invention disclosure and evaluation process (i.e., the process in each company for writing and submitting patents). In addition, patent licensing or technology partnership groups in the organization should leverage Innovation Mining analytics whenever such arrangements are under consideration. The Innovation Mining Process in Action: Finding the Most Relevant Patents in a Large Portfolio The best way to make the Innovation Mining process clear is with a specific example using patents. In this fictional scenario, 5 we show how to compare two patent portfolios. Imagine that IBM wants to see what intellectual property it owns that might be potentially useful to companies in unrelated industries. One example of such an industry would be medical equipment suppliers. So we begin by finding a significant set of patents related to medical equipment. This can be done with search queries, but the danger of this approach is that we might miss significant areas of the space, by not knowing exactly what terms to query with. A better way is to pick one or more major companies in the industry and retrieve all of their patents. This should provide us with a good cross-section of patents in the industry. We use structured queries and unstructured searches against the text abstract field and the major class code field in the patent data warehouse to create a collection that is generally relevant to the medical equipment subject area. We then look at the assignee field counts for the patents in this collection to find a single company that is broadly representative of this area. Finally, we use a structured query against this assignee name (and its variations) to find all the patents owned by this company. This resulting collection is 1,942 patents in size. This gives us a starting place to begin mapping out the medical equipment technology space. In some cases, we might wish to refine this collection further by looking at the patent grant date and removing older patents from consideration, but for now let s simply assume we are finished exploring. Next, we move on to the Understand phase.

163 5: Mining to Improve Innovation 121 Creating a Taxonomy We use text clustering to create a taxonomy of patents. In this case, we use k-means clustering. Intuitive clustering also works, but has fewer advantages over k-means in the patent space. One reason is that patents are longer documents that contain more words, with each word generally occurring in more examples, and multiple times in each example. This makes the cohesion metric somewhat less powerful. 6 The text we use as input for clustering is usually a concatenation of title, abstract, and claims. These fields give us the primary gist of the patent, without adding potentially extraneous detail that is prior art or background information. It is important to include the claims, because they are the legal definition of the patent, whereas the title and abstract may not necessarily contain all the important key elements. The taxonomy we create for the medical equipment supplier is to help us to model their IP space to find patents in IBM s portfolio that might be relevant to the medical equipment supplier s business (see Figure 5-3). Figure 5-3 Initial medical equipment supplier taxonomy Editing the Taxonomy The purpose of editing is to better understand what technologies make up the medical equipment supplier s portfolio and to make sure that text clustering has done a reasonable job in creating meaningful categories that will help us to define the medical equipment supplier s business. It is not necessary to have deep expertise in the domain

122 Mining the Talk Unlocking the Business Value in Unstructured Information in order to rename categories or make other adjustments that help to make the categories more meaningful.

164 122 Mining the Talk Unlocking the Business Value in Unstructured Information in order to rename categories or make other adjustments that help to make the categories more meaningful. In the editing process, we view each of the categories in order of decreasing cohesion, renaming them as we go to names that make more sense. For instance, the category in Figure 5-4, originally named extending,port, we rename to sealing and closures, based upon looking at most typical patents in the category and also at summary term statistics for the category. Figure 5-4 Renaming a category When we discover that two categories are similar, we merge them into one. Similar categories can be found by sorting on the Distinctness metric. In the example shown in Figure 5-5, we can see that the chamber and fluid & flow categories are the least distinct. They have the same distinctness score because distinctness is based on distance to nearest category, and these two must be nearest to each other. Some classes may be deleted if they have no clear theme. The contained class in this taxonomy is one example. Before deleting, we check to be sure the documents in the category will be sent to the correct location. The chart in Figure 5-6 indicates the percentage of documents that will be sent to each of the other remaining classes. Each slice of the pie chart is selectable, and we can look at a display of typical patents that represent that slice to determine its content. This helps the analyst to verify that the results of the class deletion will be reasonable.

Secondary classes for contained The resulting

165 5: Mining to Improve Innovation 123 Figure 5-5 Merging classes based on distinctness Figure 5-6 Secondary classes for contained The resulting taxonomy after editing looks like the one in Figure 5-7.

124 Mining the Talk Unlocking the Business Value in Unstructured Information Figure 5-7 Fully edited taxonomy This completes the Understand phase.

166 124 Mining the Talk Unlocking the Business Value in Unstructured Information Figure 5-7 Fully edited taxonomy This completes the Understand phase. Applying the Medical Equipment Supplier Taxonomy to IBM s Portfolio Now that we understand what it means to be a medical equipment patent, we can use that understanding to find similar patents in IBM s portfolio. This is a really powerful idea. The concept is to use the category models we have generated for the medical equipment supplier portfolio and apply these to a different set of data: the IBM portfolio. We use a centroid classification model 7 for the taxonomy to classify each IBM patent into the nearest medical equipment supplier category. The cosine similarity metric is used to measure the distance between every IBM patent and every medical equipment supplier centroid (having first converted each IBM patent to the feature space of the medical equipment supplier centroids). Then each IBM patent is classified into the category of the closest medical equipment supplier centroid. Figure 5-8 shows the result of this classification process. Notice that there are over 40,000 patents and applications in the IBM portfolio. This is far more than we could comb through individually to find patents of interest to the medical equipment supplier (even using search, how would we know which terms to search for?). Of course, this results in far more patents in each category that actually should exist. The classification model is only as good as the training set used to create it, and since the domain of IBM s patents is, for the most part, far different than the field of the medical equipment supplier s patents, it is no surprise that many of IBM s patents fall into categories they should not. However, it is also true that if we look in each category and view those patents that are nearest to each of the medical equipment supplier centroids, we should find those patents that are most similar to the medical equipment supplier s, and thus the ones that are most relevant.

5: Mining to Improve Innovation 125 Figure 5-8 Medical equipment supplier taxonomy applied to IBM s portfolio Examining each category, we find individual IBM patents that are most related to the

167 5: Mining to Improve Innovation 125 Figure 5-8 Medical equipment supplier taxonomy applied to IBM s portfolio Examining each category, we find individual IBM patents that are most related to the medical equipment supplier s business by focusing on the patents nearest to each category centroid (see Figure 5-9). Figure 5-9 IBM patents in the pump category The patent shown in the figure is the Most Typical IBM patent in the medical equipment supplier pump category. This is a very surprising patent to find in the IBM portfolio. Listed next are some other surprises we found.

168 126 Mining the Talk Unlocking the Business Value in Unstructured Information Most Typical from the Valve Category Patent number: Title: Torsion heart valve Abstract: A rigid leaflet blood check valve uses a torsion wire suspension to suspend the leaflet. The leaflet is non-contacting with the housing. Complete washout of all valve parts in blood contact exists. This eliminates areas of blood stasis which exist in valves employing conventional pivot bearings. Relative motion, wear, and washout problems have been eliminated in the valve. Most Typical from the Blood Category Patent number: Title: Blood pump actuator Abstract: An implantable blood pump actuator uses an efficient direct drive voice coil linear motor to power an hydraulic piston. The motor is immersed in oil and has but one moving part, a magnet which is attached to the piston. The piston is supported using a linear double acting hydrodynamic oil bearing that eliminates all wear and will potentially give an infinitely long life. There are no flexing leads, springs to break or parts to fatigue. A total artificial heart embodiment is very light, small, and as efficient as present prosthesis. The motor uses two commutated coils for high efficiency and has inherently low side forces due to the use of coils on both the inside and outside diameters of the magnet. The motor is cooled by forced convection of oil from the piston motion and by direct immersion of the coils in the oil. The highly effective cooling results in a smaller lighter-weight motor. Amazing what kinds of things turn up in IBM s portfolio! Imagine how difficult it would be to find these patents using search techniques alone, given limited knowledge of the medical equipment domain. Searching on the phrase medical equipment would not have revealed any of these. By creating powerful models to do the sifting for us, we have created a kind of surrogate text reader that knows what we want (because we taught it from examples) and goes out and finds it for us. This is the power of the Mining the Talk approach. The taxonomy generation and editing phase lets the user interact with the text to derive a taxonomy that captures a combination of data plus knowledge. The text information provides the substance the concrete examples of what actually exists out in the world. The knowledge provides the motivation what do you want to learn

169 5: Mining to Improve Innovation 127 from the data and how do you want to apply it? Without knowledge, data is meaningless. Without data, knowledge is abstract and theoretical. The Mining the Talk approach leverages both to provide value that neither can supply alone. The results of such analyses can be monetized in many ways. There is the possibility of licensing or outright sale of patents owned in a technology that is not currently being developed by the organization. Another approach is to partner with a company doing similar work in the space to leverage each other s strengths one company does the product development, while the other takes the product to market, for example. Research for a New Product: Mars Snackfood Division In 2005, a research team in the Mars company came to us with an interesting problem. They had a new product they were about to introduce called CocoaVia. CocoaVia is a line of chocolate-based products that maintain healthy serum cholesterol levels and promote healthy circulation. It is based on a novel technology for making chocolate products from uniquely processed cocoa that allowed the product to retain: natural plant extracts clinically proven to reduce bad (LDL) cholesterol levels up to 8%. All CocoaVia Brand snacks also have a guaranteed 100 mg of naturally occurring cocoa flavanols, like those found in red wine and green tea. Studies indicate that flavanols in cocoa and certain chocolates have a beneficial effect on cardiovascular health. CocoaVia.com The questions that Mars had for us were: 1) How was this product likely to be perceived in the marketplace?; 2) What claims could they reasonably make about its health benefits that would resonate with current consumer trends?; 3) What were their competitors doing in this space?. As it turns out, each of these questions could be answered by analyzing the talk from three different data sources. Here is how we accomplished it. Marketplace Analysis (Web Data) Understanding marketplace perception can best be answered by looking at what people were saying about the health benefits of cocoa on the Web. To get at this, we start with a set of snippets (see Chapter 3, Mining the Voice of the Customer, for a description of how to create snippets) around the term chocolate, extracted from web pages. We then cluster these snippets into categories and edit the categories to create a taxonomy, and add a time-based analysis of the categories to create the display shown in Figure 5-10.

128 Mining the Talk Unlocking the Business Value in Unstructured Information Figure 5-10 Trends in chocolate web page snippets We select the Health category as being especially relevant to the new

170 128 Mining the Talk Unlocking the Business Value in Unstructured Information Figure 5-10 Trends in chocolate web page snippets We select the Health category as being especially relevant to the new product and create a trend chart. We can use our Chi-Squared co-occurrence analysis to detect correlations between contiguous time intervals and words/phrases in the dictionary. These correlations are then shown as labels in the trend chart (see Figure 5-11). Each term label is followed by a number in parenthesis indicating the support for that word or phrase during that time period. Words shown on the chart are co-occurring significantly with the Health category during the time period where they are displayed. The interesting thing to note here is the movement away from mentions of sugar_free chocolate toward words like heart and phrases like health_fitness. This is a positive indication that the CocoaVia product (which is not sugar free) may be filling a recent consumer need. The next chart reveals this more clearly (see Figure 5-12). Web mentions of sugar free and chocolate are actually on the decline, in stark contrast to health chocolate mentions. Looking at the posts that underlie these trends shows snippets corresponding to both consumers, newspapers, and other product advertisements. So, from the standpoint of public/marketplace acceptance of healthy chocolate, it appears that CocoaVia is on track to hit a receptive marketplace.

171 5: Mining to Improve Innovation 129 Figure 5-11 Health trends related to chocolate Figure 5-12 Sugar free compared to health in chocolate

172 130 Mining the Talk Unlocking the Business Value in Unstructured Information Health Benefits Analysis (Medline Abstracts) Understanding what health benefits can be claimed is best answered by looking at Medline research abstracts, because Medline consists of scientific, peer-reviewed journals. (Medline is an indexing service for research in medicine and related fields provided by the U.S. National Library of Medicine.) For this analysis, we did a query on polyphenols (the substance the CocoaVia product contains that contributes to health). We then did a snippet clustering around the word conclusion in order to focus on just the final result of each study. The resulting taxonomy revealed several relevant study results for Mars: Olive oil and red wine antioxidant polyphenols at nutritionally relevant concentrations transcriptionally inhibit endothelial adhesion molecule expression, thus partially explaining atheroprotection from Mediterranean diets. ( nih.gov/entrez/query.fcgi?cmd=retrieve&db=pubmed&dopt=abstract&list_uids= ) Our experiments are the first demonstration that dietary polyphenols can modulate in vivo oxidative damage in the gastrointestinal tract of rodents. These data support the hypothesis that dietary polyphenols might have both a protective and a therapeutic potential in oxidative damage-related pathologies. ( entrez/query.fcgi?cmd=retrieve&db=pubmed&dopt=abstract&list_uids= ) A slight reduction in saturated fat intake, along with the use of extra-virgin olive oil, markedly lowers daily antihypertensive dosage requirement, possibly through enhanced nitric oxide levels stimulated by polyphenols. ( gov/entrez/query.fcgi?cmd=retrieve&db=pubmed&dopt=abstract&list_uids= ) Cocoa powder and dark chocolate may favorably affect cardiovascular disease risk status by modestly reducing LDL oxidation susceptibility, increasing serum total antioxidant capacity and HDL-cholesterol concentrations, and not adversely affecting prostaglandins. ( pubmed&dopt=abstract&list_uids= ) Both clinical and experimental evidence suggest that red wine does indeed offer a greater protection to health than other alcoholic beverages. This protection has been attributed to grape-derived antioxidant polyphenolic compounds found particularly in red wine. ( pubmed&dopt=abstract&list_uids= ) Clearly substantial clinical evidence exists to support the idea that polyphenols have health benefits. It seems then reasonable for Mars to communicate these potential health benefits of the CocoaVia products in their marketing.

173 5: Mining to Improve Innovation 131 Competitive Analysis (Patents) We look at patents to understand the competitive landscape using the Explore- Understand-Analyze method. For this analysis, we first collected all patents that mention polyphenols in the text. Next we created and edited a taxonomy, making sure that we created a category around patents that were related to process of interest namely that of preserving polyphenols in cocoa during processing. We then created a cooccurrence table showing the relationship between assignees for the polyphenol set and subject areas. The result is shown in Figure Very High Affinity = Moderate Affinity = Low Affinity = No Affinity = Size Archer Daniels Midland BATTELLE MEMORIAL... MEIJI SEIKAKA... Mars N/A NESTEC, SA PACIFIC FIM M... PROCTER + GAMBLE preserve polyphenols in cocoa 45 0 (0.00%) 0 (0.00%) 1 (100.00%) 41 (45.56%) 3 (20.00%) 0 (0.00%) 0 (0.00%) 0 (0.00%) antineoplastic 34 0 (0.00%) 0 (0.00%) 0 (0.00%) 34 (37.78%) 0 (0.00%) 0 (0.00%) 0 (0.00%) 0 (0.00%) extraction of polyphencis 7 0 (0.00%) 0 (0.00%) 0 (0.00%) 5 (5.56%) 2 (13.33%) 0 (0.00%) 0 (0.00%) 0 (0.00%) extract_compounds from cocoa 4 0 (0.00%) 0 (0.00%) 0 (0.00%) 4 (4.44%) 0 (0.00%) 0 (0.00%) 0 (0.00%) 0 (0.00%) beverage products 8 0 (0.00%) 0 (0.00%) 0 (0.00%) 3 (3.33%) 1 (6.67%) 1 (100.00%) 1 (100.00%) 2 (100.00%) postprandial 2 0 (0.00%) 0 (0.00%) 0 (0.00%) 2 (2.22%) 0 (0.00%) 0 (0.00%) 0 (0.00%) 0 (0.00%) extraction of stercis using solvent 3 0 (0.00%) 0 (0.00%) 0 (0.00%) 1 (1.11%) 2 (13.33%) 0 (0.00%) 0 (0.00%) 0 (0.00%) chocolate storage 4 0 (0.00%) 2 (100.00%) 0 (0.00%) 0 (0.00%) 2 (13.33%) 0 (0.00%) 0 (0.00%) 0 (0.00%) preservation of flavanoids 4 0 (0.00%) 0 (0.00%) 0 (0.00%) 0 (0.00%) 4 (26.67%) 0 (0.00%) 0 (0.00%) 0 (0.00%) cosmetic 3 2 (100.00%) 0 (0.00%) 0 (0.00%) 0 (0.00%) 1 (6.67%) 0 (0.00%) 0 (0.00%) 0 (0.00%) Total Figure 5-13 Assignees vs. subject area for polyphenol patents Mars has 41 of the 45 patents in the space of preserving polyphenols in cocoa. Mars seems to have a significant competitive advantage in this key technology. A closer look at the four competitive patents did not reveal any concerns. So from this standpoint as well, it would seem that the product launch should be a go. In fact, Mars did subsequently launch the CocoaVia product line, and it is still available in stores today as of the writing of this book. Summary Innovation Mining is a powerful tool for discovering relationships in the business ecosystem. The examples we ve shown serve to illustrate how Mining the Talk enhances corporate innovation, enabling monetization of technology that might otherwise sit on the shelf. A new product or device that only sits in the lab is not truly innovative. It s only when that product or device finds the right development partner or consumer that true innovation takes place. Mining the Talk helps to find the right home for technology and products by exploring the space around them and finding the right development partner or market niche a crucial component for successful innovation.

174 132 Mining the Talk Unlocking the Business Value in Unstructured Information But good Innovation Mining is also not a trivial exercise. It requires direct access to significantly large, dynamic data sources and expertise in using all of the Mining the Talk techniques we have outlined so far. Creating an ongoing mining process for innovation requires a significant investment, but the potential rewards in greater leverage of corporate technology into new products and services far outweighs the cost. Endnotes 1. Gassmann, O., and Enkel, E. (2004). Towards a Theory of Open Innovation: Three Core Process Archetypes. Paper presented at the R&D Management Conference 2004, Lisbon A service of the U.S. National Library of Medicine. 5. We use a fictional scenario here because actual patent licensing or assignment arrangements that we have used Mining the Talk technology to facilitate are considered confidential IBM information. 6. After all, in the extreme worst case, where every word occurs in every patent, every word has exactly the same cohesion score. Although this is hardly ever the case, it does show that cohesion as a metric eventually breaks down when the documents become long and the words used in each become less differentiated. This is more the case with patents than with shorter documents, such as problem tickets. 7. The centroid model is only one text classification model that could be applied here. It is usually the best one to use because it models the way that k-means created the categories in the first place.

175

Dean Meltz, Rick Long Mark Harrington Robert Hain Geoff Nicholls An Introduction to IMS Your Complete Guide to IBM s Information Management System Introduces IMS, one of the world s premiere software

176 Dean Meltz, Rick Long Mark Harrington Robert Hain Geoff Nicholls An Introduction to IMS Your Complete Guide to IBM s Information Management System Introduces IMS, one of the world s premiere software products Thoroughly covers key IMS functions, from security to Java support For both new and experienced IMS administrators, programmers, architects, and managers Prerequisite reading for IBM IMS Mastery Certificate Program IMS serves more than 95 percent of Fortune 1000 companies, manages 15,000,000 gigabytes of production data, and supports more than two hundred million users per day. The brand-new IBM IMS Version 9 is not just the world s #1 platform for very large online transaction processing: it integrates with Web application server technology to enable tomorrow s most powerful Web-based applications. Now, for the first time in many years, there s a completely up-todate guide to understanding IMS in your business environment. An Introduction to IMS covers Installing and configuring IMS Version 9 Understanding and implementing the IMS hierarchical database model Understanding and working with the IMS Transaction Manager Mastering core application programming concepts, including program structure and IMS control blocks Taking advantage of IMS 9 Java programming enhancements Working with the IMS Master Terminal Administering IMS: system definition, customization, logging, security, operations, and more Running IMS in a Parallel Sysplex environment Whether you ve spent a career running IMS or you are encountering IMS for the first time, this book delivers the insights and skills you need to succeed as an application designer, developer, or administrator hardcover 592 pages Table of Contents Part I Overview of IMS 1. IMS: Then and Now 2. overview of the IMS Product 3. Accessing IMS 4. IMS and z/os 5. Setting Up and Running IMS Part II IMS Database Manager 6. overview of the IMS Database Manager 7. overview of the IMS Hierarchical Database Model 8. Implementing the IMS Hierarchical Database Model 9. Data Sharing 10. The Database Reorganization Process 11. The Database Recovery Process Part III IMS Transaction Manager 12. overview of the IMS Transaction Manager 13. How IMS TM Processes Input Part IV IMS Application Development 14. Application Programming Overview 15. Application Programming for the IMS Database Manager 16. Application Programming for the IMS Transaction Manager 17. editing and Formatting Messages 18. Application Programming in Java Part V IMS System Administration 19. The IMS System Definition Process 20. Customizing IMS 21. IMS Security 22. IMS Logging 23. database Recovery Control (DBRC) Facility 24. Operating IMS 25. IMS System Recovery 26. IBM IMS Tools Part VI IMS in a Parallel Sysplex Environment 27. Introduction to Parallel Sysplex 28. IMSplexes Part VII Appendixes

177 CHAPTER 18 Application Programming in Java T he IMS Java function implements the JDBC API, which is the standard Java interface for database access. JDBC uses SQL (structured query language) calls. The IMS s implementation of JDBC supports a selected subset of the full facilities of the JDBC 2.1 API. The IMS-supported subset of SQL provides all of the functionality (select, insert, update, delete) of traditional IMS applications. The IMS Java function also extends the JDBC interface for storage and retrieval of XML documents in IMS. For more information, see XML Storage in IMS Databases on page 321. In addition to JDBC, the IMS Java function has another interface to the IMS databases called the IMS Java hierarchical database interface. This interface is similar to the standard IMS DL/I database call interface and provides lower-level access to IMS database functions than the JDBC interface. However, JDBC is the recommended access interface to IMS databases, and this chapter focuses on JDBC. For information about the IMS Java hierarchical database interface, see Appendix C, IMS Java Hierarchical Database Interface in IMS Version 9: IMS Java Guide and Reference. In This Chapter: Describing an IMS Database to the IMS Java Function on page 312 Supported SQL Keywords on page 313 Developing JMP Applications on page 314 Developing JBP Applications on page 315 Enterprise COBOL Interoperability with JMP and JBP Applications on page 316 Accessing DB2 UDB for z/os Databases from JMP or JBP Applications on page 317 Developing Java Applications That Run Outside of IMS on page 317 XML Storage in IMS Databases on page

178 312 Chapter 18 Application Programming in Java DESCRIBING AN IMS DATABASE TO THE IMS JAVA FUNCTION In order for a Java application to access an IMS database, it needs information about the database. This information is contained in the PSB (program specification block) and in DBDs (database descriptions), which you must first convert into a form that you can use in the Java application: a subclass of the com.ibm.ims.db.dlidatabaseview class called the IMS Java metadata class. The DLIModel utility generates this metadata from the IMS PSBs, DBDs, COBOL copybooks, and other input specified by utility control statements. In addition to creating metadata, the DLIModel utility also: Generates XML schemas of IMS databases. These schemas are used when retrieving XML data from or storing XML data in IMS databases. Incorporates additional field information from XMI input files that describe COBOL copybooks. Incorporates additional PCB, segment, and field information, or overrides existing information. Generates a DLIModel IMS Java Report, which is designed to assist Java application programmers. The DLIModel IMS Java Report is a text file that describes the Java view of the PSB and its databases. Generates an XMI description of the PSB and its databases. The DLIModel utility can process most types of PSBs and databases. For example, the utility supports: All database organizations except MSDB, HSAM, SHSAM, and GSAM All types and implementations of logical relationships Secondary indexes, except for shared secondary indexes Secondary indexes that are processed as standalone databases PSBs that specify field-level sensitivity The DLIModel utility is a Java application, so you can run it from the UNIX System Services prompt, or you can run it using the z/os-provided BPXBATCH utility. Figure 18-1 on page 313 shows the inputs to and outputs from the DLIModel utility. Related Reading: For more information about the DLIModel utility, see IMS Version 9: Utilities Reference: System.

179 An Introduction to IMS 313 Figure 18-1 DLIModel Utility Inputs and Outputs SUPPORTED SQL KEYWORDS Table 18-1 contains the portable SQL keywords that are currently supported by the IMS Java function. None of the keywords are case-sensitive. Table 18-1 ALL AND AS ASC AVG COUNT DELETE DESC DISTINCT FROM GROUP BY SQL Keywords Supported by the IMS Java Function INSERT INTO MAX MIN OR ORDER BY SELECT SUM UPDATE WHERE

180 314 Chapter 18 Application Programming in Java DEVELOPING JMP APPLICATIONS JMP applications access the IMS message queue to receive messages to process and to send output messages. Therefore, you must define input and output message classes by subclassing the IMSFieldMessage class. The IMS Java class libraries provide the capability to process IMSFieldMessage objects. JMP applications commit or roll back the processing of each message by calling IMSTransaction.getTransaction().commit() or IMSTransaction.getTransaction().rollback(). JMP applications are started when IMS receives a message with a transaction code for the JMP application and schedules the message. JMP applications end when there are no more messages with that transaction code to process. A transaction begins when the application gets an input message and ends when the application commits the transaction. To get an input message, the application calls the getuniquemessage method. The application must commit or roll back any database processing. The application must issue a commit call immediately before calling subsequent getuniquemessage methods. Figure 18-2 shows the general flow of a JMP application program. public static void main(string args[]) { conn = DriverManager.getConnection(...); //Establish DB connection while(messagequeue.getuniquemessage(...)){ //Get input message, which //starts transaction } results=statement.executequery(...); //Perform DB processing... MessageQueue.insertMessage(...); //Send output messages... IMSTransaction.getTransaction().commit(); //Commit and end transaction } conn.close(); return; //Close DB connection Figure 18-2 JMP Application Example

181 An Introduction to IMS 315 JMP Applications and Conversational Transactions A conversational program runs in a JMP region and processes conversational transactions that are made up of several steps. It does not process the entire transaction at the same time. A conversational program divides processing into a connected series of terminal-to-program-to-terminal interactions. Use conversational processing when one transaction contains several parts. A nonconversational program receives a message from a terminal, processes the request, and sends a message back to the terminal. A conversational program receives a message from a terminal and replies to the terminal, but it saves the data from the transaction in a scratch pad area (SPA). Then, when the person at the terminal enters more data, the program has the data it saved from the last message in the SPA, so it can continue processing the request without the person at the terminal having to enter the data again. The application package classes enable applications to be built using the IMS Java function. Related Reading: For details about the classes you use to develop a JMP application, see the IMS Java API Specification, which is available on the IMS Java Web site. Go to and link to the IMS Java page. DEVELOPING JBP APPLICATIONS JBP applications do not access the IMS message queue, and therefore you do not need to subclass the IMSFieldMessage class. JBP applications are similar to JMP applications, except that JBP applications do not receive input messages from the IMS message queue. The program should periodically issue commit calls, except for applications that have the PSB PROCOPT=GO parameter. Unlike BMP applications, JBP applications must be non-message-driven applications. Similarly to BMP applications, JBP applications can use symbolic checkpoint and restart calls to restart the application after an abend. The primary methods for symbolic checkpoint and restart are: IMSTransaction().checkpoint() IMSTransaction().restart() These methods perform analogous functions to the following DL/I system service calls: (symbolic) CHKP and XRST.

182 316 Chapter 18 Application Programming in Java Figure 18-3 shows a sample JBP application that connects to a database, makes a restart call, performs database processing, periodically checkpoints, and disconnects from the database at the end of the program. public static void main(string args[]) { conn = DriverManager.getConnection(...); //Establish DB connection IMSTransaction.getTransaction().retart(); //Restart application //after abend from last //checkpoint repeat { repeat { results=statement.executequery(...); //Perform DB processing... MessageQueue.insertMessage(...); //Send output messages... } } IMSTransaction.getTransaction().checkpoint(); //Periodic checkpoints // divide work } conn.close(); return; //Close DB connection Figure 18-3 JBP Application Example Related Reading: For details about the classes you use to develop a JBP application, see the IMS Java API Specification, which is available on the IMS Java Web site. Go to and link to the IMS Java page. ENTERPRISE COBOL INTEROPERABILITY WITH JMP AND JBP APPLICATIONS IMS Enterprise COBOL for z/os and OS/390 Version 3 Release 2 supports interoperation between COBOL and Java languages when running in a JMP or JBP region. With this support, you can: Call an object-oriented (OO) COBOL application from an IMS Java application by building the front-end application, which processes messages, in Java and the back end, which processes databases, in OO COBOL. Build an OO COBOL application containing a main routine that can invoke Java routines.

183 An Introduction to IMS 317 R E S T R I C T I O N: COBOL applications that run in an IMS Java dependent region must use the AIB interface, which requires that all PCBs in a PSB definition have a name. You can access COBOL code in a JMP or JBP region because Enterprise COBOL provides object-oriented language syntax that enables you to: Define classes with methods and data implemented in COBOL Create instances of Java and COBOL classes Invoke methods on Java and COBOL objects Write classes that inherit from Java classes or other COBOL classes Define and invoke overloaded methods Related Reading: For details on building applications that use Enterprise COBOL and that run in an IMS Java dependent region, see IMS Version 9: IMS Java Guide and Reference and Enterprise COBOL for z/os and OS/390: Programming Guide. ACCESSING DB2 UDB FOR Z/OS DATABASES FROM JMP OR JBP APPLICATIONS A JMP or JBP application can access DB2 UDB for z/os databases by using the DB2 JDBC/ SQLJ 2.0 driver or the DB2 JDBC/SQLJ 1.2 driver. The JMP or JBP region that the application is running in must also be defined with DB2 UDB for z/os attached by the DB2 Recoverable Resource Manager Services attachment facility (RRSAF). Related Reading: For information about attaching DB2 UDB for z/os to IMS for JMP or JBP application access to DB2 UDB for z/os databases, see IMS Version 9: IMS Java Guide and Reference. DEVELOPING JAVA APPLICATIONS THAT RUN OUTSIDE OF IMS The following sections briefly describe using Java application programs to access IMS databases from products other than IMS. Related Reading: For the details of the requirements for running Java applications from these products, see IMS Version 9: IMS Java Guide and Reference. WebSphere Application Server for z/os Applications You can write applications that run on WebSphere Application Server for z/os and access IMS databases when WebSphere Application Server for z/os and IMS are on the same LPAR (logical partition).

184 318 Chapter 18 Application Programming in Java To deploy an application on WebSphere Application Server for z/os, you must install the IMS JDBC resource adaptor (the IMS Java class libraries) on WebSphere Application Server for z/os, and configure both IMS open database access (ODBA) and the database resource adapter (DRA). Figure 18-4 shows an Enterprise JavaBean (EJB) that is accessing IMS data. JDBC or IMS Java hierarchical interface calls are passed to the IMS Java layer, which converts the calls to DL/I calls. The IMS Java layer passes these calls to ODBA, which uses the DRA to access the DL/I region in IMS. Remote Data Access with WebSphere Application Server Applications With the IMS Java function remote database services, you can develop and deploy applications that run on non-z/os platforms and access IMS databases remotely. Unlike other Java solutions for IMS, you do not need to develop a z/os application or access a legacy z/os application to have access to IMS data. Therefore, the IMS Java function is an ideal solution for developing IMS applications in a WebSphere environment. Figure 18-4 WebSphere Application Server for z/os EJB Using the IMS Java Function

185 An Introduction to IMS 319 Figure 18-5 shows the components that are required for an enterprise application (in this case, an EJB) on a non-z/os platform to access IMS DB. DB2 UDB for z/os Stored Procedures You can write DB2 UDB for z/os Java stored procedures that access IMS databases. Figure 18-5 The IMS Java Function and WebSphere Application Server Components

186 320 Chapter 18 Application Programming in Java To deploy a Java stored procedure on DB2 UDB for z/os, you must configure IMS Java, ODBA, and DRA. Figure 18-6 shows a DB2 UDB for z/os stored procedure using IMS Java, ODBA, and DRA to access IMS databases. CICS Applications Java applications that run on CICS Transaction Server for z/os can access IMS databases by using IMS Java. Java applications use the IMS Java class libraries to access IMS. Other than the IMS Java layer, access to IMS from a Java application is the same as for a non-java application. Figure 18-7 on page 321 shows a JCICS application accessing an IMS database using ODBA and IMS Java. Figure 18-6 DB2 UDB for z/os Stored Procedure Using IMS Java

187 An Introduction to IMS 321 Figure 18-7 CICS Application Using IMS Java Related Reading: For information about configuring CICS for the IMS Java function and for information about developing a Java application that runs on CICS and accesses IMS databases, see IMS Version 9: IMS Java Guide and Reference. XML STORAGE IN IMS DATABASES To store XML in an IMS database or to retrieve XML from IMS, you must first generate an XML schema and the IMS Java metadata class using the DLIModel utility. The metadata and schema are used during the storage and retrieval of XML. Your application uses the IMS Java JDBC user-defined functions (UDFs) storexml and retrievexml to store XML in IMS databases, create XML from IMS data, or retrieve XML documents from IMS databases. Figure 18-8 on page 322 shows the overall process for storing and retrieving XML in IMS. retrievexml UDF The retrievexml UDF creates an XML document from an IMS database and returns an object that implements the java.sql.clob interface. It does not matter to the application whether the data is decomposed into standard IMS segments or the data is in intact XML documents in the IMS database.

188 322 Chapter 18 Application Programming in Java Figure 18-8 Overview of XML Storage in IMS The Clob JDBC type stores a character large object as a column value in a row of the result set. The getclob method retrieves the XML document from the result set. Figure 18-9 shows the relationship between the retrievexml UDF and the getclob method. Figure 18-9 Creating XML Using the retrievexml UDF and the getclob Method

189 An Introduction to IMS 323 To create an XML document, use a retrievexml UDF in the SELECT statement of your JDBC call. Pass in the name of the segment that will be the root element of the XML document (for example, retrievexml(model)). The dependent segments of the segment that you pass in will be in the generated XML document if they match the criteria listed in the WHERE clause. The segment that you specify to be the root element of the XML document does not have to be the root segment of the IMS record. The dependent segments are mapped to the XML document based on the generated XML schema. storexml UDF The storexml UDF inserts an XML document into an IMS database at the position in the database that the WHERE clause indicates. IMS, not the application, uses the XML schema and the Java metadata class to determine the physical storage of the data into the database. It does not matter to the application whether the XML is stored intact or decomposed into standard IMS segments. An XML document must be valid before it can be stored into a database. The storexml UDF validates the XML document against the XML schema before storing it. If you know that the XML document is valid and you do not want IMS to revalidate it, use the storexml(false) UDF. To store an XML document, use the storexml UDF in the INSERT INTO clause of a JDBC prepared statement. Within a single application program, you can issue INSERT calls that contain storexml UDFs against multiple PCBs in an application s PSB. The SQL query must have the following syntax: INSERT INTO PCB.Segment (storexml()) VALUES (? ) WHERE Segment.Field = value Decomposed Storage Mode for XML In decomposed storage mode, all elements and attributes are stored as regular fields in optionally repeating DL/I segments. During parsing, all tags and other XML syntactic information are checked for validity and then discarded. The parsed data is physically stored in the database as standard IMS data, meaning that each defined field in the segment is of an IMS standard type. Because all XML data is composed of string types (typically Unicode) with type information existing in the validating XML schema, each parsed data element and attribute can be converted to the corresponding IMS standard field value and stored into the target database.

190 324 Chapter 18 Application Programming in Java Inversely, during XML retrieval, DL/I segments are retrieved, fields are converted to the destination XML encoding, tags and XML syntactic information (stored in the XML schema) are added, and the XML document is composed. Figure shows how XML elements are decomposed and stored into IMS segments. Decomposed storage mode is suitable for data-centric XML documents, where the elements and attributes from the document typically are either character or numeric items of known short or medium length that lend themselves to mapping to fields in segments. Lengths are typically, though not always, fixed. The XML document data can start at any segment in the hierarchy, which is the root element in the XML document. The segments in the subtree below this segment are also included in the XML document. Elements and attributes of the XML document are stored in the dependent segments of the root element segment. Any other segments in the hierarchy that are not dependent segments of that root element segment are not part of the XML document and, therefore, are not described in the describing XML schema. Figure How XML Is Decomposed and Stored in IMS Segments

191 An Introduction to IMS 325 When an XML document is stored in the database, the value of all segment fields is extracted directly from the XML document. Therefore, any unique key field in any of the XML segments must exist in the XML document as an attribute or simple element. The XML hierarchy is defined by a PCB hierarchy that is based on either a physical or a logical database. Logical relationships are supported for retrieval and composition of XML documents, but not for inserting documents. For a legacy database, either the whole database hierarchy or any subtree of the hierarchy can be considered as a decomposed data-centric XML document. The segments and fields that comprise the decomposed XML data are determined only by the definition of a mapping (the XML schema) between those segments and fields and a document. One XML schema is generated for each database PCB. Therefore, multiple documents may be derived from a physical database hierarchy through different XML schemas. There are no restrictions on how these multiple documents overlap and share common segments or fields. A new database can be designed specifically to store a particular type of data-centric XML document in decomposed form. Intact Storage Mode for XML In intact storage mode, all or part of an XML document is stored intact in a field. The XML tags are not removed and IMS does not parse the document. XML documents can be large, so the documents can span the primary intact field, which contains the XML root element, and fields in overflow segments. The segments that contain the intact XML documents are standard IMS segments and can be processed like any other IMS segments. The fields, because they contain unparsed XML data, cannot be processed like standard IMS fields. However, intact storage of documents has the following advantages over decomposed storage mode: IMS does not need to compose or decompose the XML during storage and retrieval. Therefore, you can process intact XML documents faster than decomposed XML documents. You do not need to match the XML document content with IMS field data types or lengths. Therefore, you can store XML documents with different structure, content, and length within the same IMS database. Intact XML storage requires a new IMS database or an extension of an existing database because the XML document must be stored in segments and fields that are specifically tailored for storing intact XML. To store all or part of an XML document intact in an IMS database, the database must define a base segment, which contains the root element of the intact XML subtree. The rest of the intact XML subtree is stored in overflow segments, which are child segments of the base segment.

192 326 Chapter 18 Application Programming in Java The base segment contains the root element of the intact XML subtree and any decomposed or non-xml fields. The format of the base segment is defined in the DBD. The overflow segment contains only the overflow XML data field. The format of the overflow XML data field is defined in the DBD. Side Segments for Secondary Indexing IMS cannot search intact XML documents for specific elements within the document. However, you can create a side segment that contains specific XML element data. IMS stores the XML document intact, and also decomposes a specific piece of XML data into a standard IMS segment. This segment can then be searched with a secondary index. Figure shows a base segment, an overflow segment, and the side segment for secondary indexing. Figure Intact Storage of XML with a Secondary Index

More Titles of Interest Implementing ITIL Configuration Management Larry Klosterboer 0132425939 paperback 264 pages A practical, start-to-finish guide to ITIL configuration management for every IT

ITIL-certified architect and solutions provider Larry Klosterboer helps you establish a clear roadmap for success, customize standard processes to your unique needs, and avoid the pitfalls that stand

193 More Titles of Interest Implementing ITIL Configuration Management Larry Klosterboer paperback 264 pages A practical, start-to-finish guide to ITIL configuration management for every IT leader, manager, and practitioner. ITIL-certified architect and solutions provider Larry Klosterboer helps you establish a clear roadmap for success, customize standard processes to your unique needs, and avoid the pitfalls that stand in your way. You ll learn how to plan your implementation, deploy tools and processes, administer ongoing configuration management tasks, refine ITIL information, and leverage it for competitive advantage. Throughout, Klosterboer demystifies ITIL s jargon, illuminates each technique with real-world advice and examples, and helps you focus on the specific techniques that offer maximum business value in your environment. for more information A Practical Guide to Trusted Computing David Challener, Kent Yoder, Ryan Catherman, David Safford, Leendert Van Doorn paperback 384 pages The Trusted Platform Module (TPM) makes secure hardware possible by providing a complete, open industry standard for implementing trusted computing hardware subsystems in PCs. Now, there s a start-to-finish guide for every software professional and security specialist who wants to utilize this breakthrough security technology. Authored by innovators who helped create TPM and implement its leading-edge products, this practical book covers all facets of TPM technology: what it can achieve, how it works, and how to write applications for it. The authors offer deep, real-world insights into both TPM and the Trusted Computing Group (TCG) Software Stack. Then, to demonstrate how TPM can solve many of today s most challenging security problems, they present four start-to-finish case studies, each with extensive C-based code examples. for more information

More Titles of Interest Mainframe Basics for Security Professionals Ori Pomerantz, Barbara Vander Weele, Mark Nelson, Tim Hahn 0131738569 hardcover 192 pages If you re coming to the IBM System z

Now, IBM experts have written the first authoritative book on mainframe security specifically designed to build on your experience in other environments.

194 More Titles of Interest Mainframe Basics for Security Professionals Ori Pomerantz, Barbara Vander Weele, Mark Nelson, Tim Hahn hardcover 192 pages If you re coming to the IBM System z mainframe platform from UNIX, Linux, or Windows, you need practical guidance on leveraging its unique security capabilities. Now, IBM experts have written the first authoritative book on mainframe security specifically designed to build on your experience in other environments. Even if you ve never logged onto a mainframe before, this book will teach you how to run today s z/os operating system command line and ISPF toolset and use them to efficiently perform every significant security administration task. They thoroughly introduce IBM s powerful Resource Access Control Facility (RACF) security subsystem and demonstrate how mainframe security integrates into your enterprise-wide IT security infrastructure. If you re an experienced system administrator or security professional, there s no faster way to extend your expertise into big iron environments. Developing Quality Technical Information Gretchen Hargis, Michelle Carey, Ann Kilty Hernandez, Polly Hughes, Deirdre Longo, Shannon Rouiller, Elizabeth Wilde hardcover 432 pages Direct from IBM s own documentation experts, this is the definitive guide to developing outstanding technical documentation--for the Web and for print. Using extensive before-and-after examples, illustrations, and checklists, the authors show exactly how to create documentation that s easy to find, understand, and use. This second edition includes extensive coverage of topic-based information, simplifying search and retrievability, internationalization, visual effectiveness, and much more. for more information for more information

DB2 at a Glance: The Big Picture

Chong.book Page 23 Monday, January 10, 2005 4:18 PM C HAPTER 2 DB2 at a Glance: The Big Picture T his chapter is like a book within a book: it covers a vast range of topics that will provide you not only