Technology Note. Design for Redshift with ER/Studio Data Architect

Size: px

Start display at page:

Download "Technology Note. Design for Redshift with ER/Studio Data Architect"

Colin Carr
5 years ago
Views:

1 Technology Note Design for Redshift with ER/Studio Data Architect Ron Huizenga Senior Product Manager Enterprise Architecture & Modeling April 9, 2018

2 Design for Redshift with ER/Studio Data Architect Table of Contents Overview... 2 General Modeling Considerations... 3 ER/Studio General Platform Support and Customization...4 Redshift Limitations... 7 DDL Constructs... 7 CREATE TABLE... 7 ALTER TABLE... 7 SQL Statements... 7 Unsupported PostgreSQL features... 7 Unsupported PostgreSQL data types... 7 Redshift Specific Constructs that are not in PostgreSQL... 8 Distribution Style and Distribution Keys... 8 Sort Keys... 8 DDL Considerations... 8 Physical Model... 9 Simple Method: PostSQL clause... 9 DDL Script Generation Advanced Method: Attachments and Macros Defining the Attachments Starting with an Existing Implementation Reverse Engineer Redshift as Generic DBMS using ODBC Metawizard Import from Redshift into PostgresSQL 8.0 Model Final Steps Conclusion Appendix A: DDL Generated from My TICKIT Model...36 Appendix B: TICKIT Database DDL in Standard Redshift Tutorial

3 Overview I have been data modeling for over 35 years. As such, I have seen many different data platforms come and go. I initially created data models manually, and subsequently benefitted from a wide variety of modeling tools as they emerged throughout my career. With the high level of change, there has always been a constant: the ability to take advantage of my modeling tool of choice, regardless of which tool it was at the time, to deliver a high quality design and database implementation. Even if a modeling tool did not specifically list support for a new DBMS platform by name, there was generally a way to make the modeling tool support most of the requirements. In the early days, this was relatively straight forward, since the basis for most DBMS platforms was the ANSI SQL standard. However, as time has progressed, there has been a proliferation of many data platforms with different features and widely varied rates of adoption in the marketplace. There is also a constant stream of new versions of each of those platforms, with new capabilities. Many modeling tools simply do not have the ability to keep up. However, ER/Studio is the most capable enterprise data-modeling suite in the market with capabilities that allow it to excel in conquering those challenges. Maintaining pace with the frequency of change in data platforms is an ongoing challenge. For a modeling vendor such as ourselves, prioritization is key, since it is virtually impossible to incorporate specific support for every single feature for every single platform. There are platforms that we choose not to support as distinct named platforms, simply due to low overall market adoption. With others, there can be a delay to ensure that the platform (or feature) is viable in the marketplace. However, that does not preclude the use of ER/Studio Data Architect in those situations. In fact, ER/Studio can usually be adapted easily to work with new platforms including design, forward engineering (DDL generation) and reverse engineering functionality. The high adaptability of ER/Studio Data Architect is due to the flexible and extensible architecture designed into the product from the ground up. This allows modelers to define and create additional metadata for any model construct, as well as the ability to extend generated DDL with additional syntax (pre and post SQL). In terms of data platform connectivity, we have native connectors for many platforms, metadata import bridges, and generic capabilities including ODBC connectivity. Capabilities can be extended further with custom datatypes and datatype mappings for the platforms you wish to work with. Full macro programming capability using Winwrap basic, a language which many users are already familiar with, combined with an extensive automation engine allows users to customize the capabilities, limited only by their imagination. To illustrate these points, I will now discuss how to apply the capabilities of ER/Studio to the design and implementation of a data warehouse deployed to Amazon Redshift. Redshift is gaining popularity in the marketplace. As such, we will be implementing Redshift specific platform support as part of our product roadmap. However, you can take advantage of the platform today by utilizing ER/Studio. I will also highlight some of my thought process as a modeler, when working with a new platform. I will not discuss how to set up an AWS cluster, Redshift, or the necessary security policies that are needed to connect to Redshift from your computer. This document assumes that the necessary Redshift and AWS configurations are already in place. It also assumes that the necessary ODBC and JDBC drivers have been downloaded from the Amazon Redshift website and installed. 2

4 General Modeling Considerations I would be remiss by not pointing out that far too many teams rush toward a particular physical deployment as a development activity without adequately analyzing and modeling the solution. Those teams incorrectly assume that modeling, and data modeling in particular, are an unnecessary documentation step that simply slows them down. Nothing could be further from the truth and those that bypass these necessary activities are doing a disservice to themselves and their organizations: It is paramount that we first understand the data and rules from a business standpoint, which we accomplish through logical modeling. Logical data modeling is an important analysis step and intentionally technology agnostic in order to understand and define the data elements and their relationships to one another, from a business perspective. We then derive the physical models from the logical models, adding platform specific constructs. For larger solutions, we often iterate back and forth between these activities, as different subject areas in the models evolve at a differing pace. With the above in mind, we will begin with an example for a small data warehouse intended for analytics of ticket sales. This particular example is based on the Amazon Redshift tutorial called TICKIT so that readers who may already be familiar with it can focus purely on the modeling aspects in this discussion. The TICKIT database is a tool to analyze sales activity for the fictional TICKIT web site, where users buy and sell tickets online for different types of events. In particular, analysts can identify ticket movement over time, success rates for sellers, and the best-selling events, venues, and seasons. It is small, comprised of 2 facts and 5 dimensions. 3

5 The relationships (connectors) are extremely important, since they depict the business relationships between the different concepts. ER/Studio General Platform Support and Customization As mentioned previously, ER/Studio has many different customization capabilities. In particular, I make copies of the shipped macros and datatype mappings in my own work folders. That allows me to modify them at low risk, since I can easily overlay them with original copies if the result is not what I expect. Once I have created the work folders, I can set the paths to use in the Data Architect application by selecting the menu and options for Tools-> Options-> Directories 4

Here are my shipped defaults: And the modified paths: I am ready to commence with the physical model, in which I will incorporate platform specific design constructs.

6 Here are my shipped defaults: And the modified paths: I am ready to commence with the physical model, in which I will incorporate platform specific design constructs. For this example, I wish to design and deploy for Redshift. At this time, Redshift is not one of the specifically named physical platforms in ER/Studio Data Architect, so I choose one that I feel has the highest affinity with Redshift. Generally, to arrive at this decision, I consider similarities in the physical implementation such as DDL syntax, data types, and other characteristics that I consider important. 5

7 After some consideration, I decide that the most compatible platform is PostgresSQL, since Redshift is a derivative of PostgresSQL. However, both have evolved independently with differences on both sides, so I choose PostgresSQL 8.0 as my starting point. There are some differences in datatypes, as well as constructs in Redshift that do not apply to PostgresSQL (covered in next section). My second choice would likely be Generic DBMS, which I will discuss in the reverse engineering section. When I generate my physical model from the logical model, I would like the datatype mapping to proceed as smoothly as possible, so I can modify the existing datatype mappings if I wish. All of the logical to physical mappings from logical top physical models are file driven for all the supported platforms in ER/Studio Data Architect. This allows me to add new datatypes, alter existing mapping templates, or make a user defined copy of a datatype mapping, which is applied to specific models. An example datatype mapping is below: 6

8 Redshift Limitations Those looking at Redshift for the first time will find that it behaves quite differently that the databases they have used in the past. The following is a subset of the capabilities in Postgres (and several other platforms) that are unsupported in Amazon Redshift: DDL Constructs CREATE TABLE Amazon Redshift does not support tablespaces, table partitioning, inheritance, and certain constraints. The Amazon Redshift implementation of CREATE TABLE enables you to define the sort and distribution algorithms for tables to optimize parallel processing. ALTER TABLE ALTER COLUMN actions are not supported. ADD COLUMN supports adding only one column in each ALTER TABLE statement. COPY (the Amazon Redshift COPY command is highly specialized to enable the loading of data from Amazon S3 buckets and Amazon DynamoDB tables). SQL Statements INSERT, UPDATE, and DELETE: WITH clause is not supported. Unsupported PostgreSQL features Table partitioning (range and list partitioning), Tablespaces Constraints (Unique, Foreign key, Primary key, Check constraints, Exclusion constraints) NOTE: Unique, primary key, and foreign key constraints are permitted, but are informational only. They are not enforced by the system, but there is still value in defining them since they are used by the query planner. Inheritance Indexes Collations Stored procedures Triggers Sequences Unsupported PostgreSQL data types Arrays, BIT, BIT VARYING, BYTEA, Composite Types, Date/Time Types, INTERVAL, TIME, Enumerated Types, Geometric Types, JSON, SERIAL, BIGSERIAL, SMALLSERIAL, MONEY, Object Identifier Types 7

9 Redshift Specific Constructs that are not in PostgreSQL Redshift is a data warehouse platform, optimized for very fast execution of complex analytic queries against very large data sets. Due to the massive amount of data in a data warehouse, specific design parameters facilitate performance optimization. Distribution Style and Distribution Keys Distribution Styles (DISTSTYLE) defines the data distribution style for the entire table. Redshift distributes the rows of a table to the compute nodes that make up the cluster. If the data is heavily skewed, meaning a large amount is placed on a single node, query performance suffers. Even distribution prevents these bottlenecks by ensuring that nodes equally share the processing load according the distribution style specified for the table as follows: KEY - Distribution Keys (DISTKEY) means the data is distributed by the values in the DISTKEY column(s). When join columns of joining tables are set as distribution keys, the joining rows from both tables are collocated on the compute nodes. This allows the optimizer to perform joins more efficiently. When DISTSTYLE of KEY is specified, one or more DISTKEY columns must be specified for the table. EVEN means the data in the table spreads evenly across the nodes in a cluster in a round-robin distribution, determined by Row ID s. The result is distribution of approximately the same number of rows to each node. EVEN is the default distribution style and assumed unless a different DISTSTYLE is specified. ALL means that a copy of the entire table is distributed to every node. This distribution style ensures that all the rows required for any join to this table are available on every node. The downside is that it multiplies storage requirements, increases load time and increases maintenance times for the table. ALL distribution can improve execution time when used with certain dimension tables where KEY distribution is not appropriate. It is generally suited to small tables used frequently in joins. Sort Keys Sort Keys (SORTKEY) determine the order in which rows in a table are stored. If properly applied, sort Keys allow the bypass of large chunks of data during query processing. Reduced data scanning improves query performance significantly. DDL Considerations Distribution keys and sort keys can be specified as keywords for a specific column, when only one column is used for the respective key, or as the last clause in the create table DDL. Specifying as the last clause in the CREATE TABLE statement is the most flexible, since it supports DISTKEY and SORTKEY that have single or multiple columns. One of the critical disciplines in modeling is consistency, so I choose to define them as the last portion of the CREATE TABLE statement. This is also consistent with DISTSTYLE, defined only at the table level. NOTE: If specifying a DISTKEY, DISTSTYLE is not required. The value of KEY will be assumed by default. Changing Distribution style, distribution keys and sort keys MUST be specified as part of a CREATE TABLE statement. Redshift does not allow them to be changed using an ALTER TABLE statement. If there is a need to change them a new table will need to be created with the correct parameters and data will need to be copied from the old table to the new table. The old table will then need to be deleted. Once that is completed, the old name is then be applied to rename the new table. Depending on the amount of data involved, combined with the change in distribution style, this can be time consuming. 8

Physical Model Once I have generated the physical model from the logical model as PostgresSQL, as well as using my customized datatype mapping template, the next step is to update the physical model

10 Physical Model Once I have generated the physical model from the logical model as PostgresSQL, as well as using my customized datatype mapping template, the next step is to update the physical model with specifications for Distribution Style and Sort keys, since these are critical to Redshift performance. There are 2 ways to do this. The first is very straight forward. The second is more advanced, but provides additional model documentation and communication benefits. Simple Method: PostSQL clause As stated, this is very straight forward. The table editor in the physical model has tabs to specify PreSQL & PostSQL to be used in DDL generation for the table. The following screen shows how I have specified it for the sales (fact) table: NOTE: The DDL may look incorrect if you click on the full DDL preview tab. Note that there is a semicolon after the initial CREATE TABLE statement, immediately before the PostSQL. We will eliminate this when we generate the DDL script from the physical model. 9

11 DDL Script Generation I will now review the DDL generation options that I use for Redshift. The DDL generation preferences can be saved to a file, so that they can be defined once and re-used as required. 10

In the general options, I am generating foreign key constraints.

12 On the second screen, I have generation of constraint names turned off. I am also generating primary keys, since they can be used by the query optimizer. In the general options, I am generating foreign key constraints. Just like primary keys, they are informational only, but can be utilized by the query optimizer. I have also cleared the SQL statement delimiter from the field at the bottom of the screen. That will eliminate the separator between our CREATE TABLE statement and the PostSQL clause, combining them into a consolidated CREATE TABLE statement. 11

The last screen allows me to double check my specified options and save the DDL preferences to a file. I can the preview the script. Clicking the Finish button will allow me to save the DDL script.

13 The last screen allows me to double check my specified options and save the DDL preferences to a file. I can the preview the script. Clicking the Finish button will allow me to save the DDL script. Lastly, I will connect to Redshift with my editor of choice and execute the script. The full generated script is shown in Appendix A at the end of the document. Execution of scripts is generally the most common use case, as opposed to directly connecting to a database and generating (or updating it). In most initiatives, database creation and updates must coincide with other development deliverables. In a data warehouse environment, this will typically include staging area updates, including extract, transform & load (ETL) from source systems to the staging area, as well as from the staging area to the data warehouse itself. All source data stores, staging area and data warehouse can (and should) be modelled using ER/Studio Data Architect. Visual Data Lineage modeling in ER/Studio Data Architect can model ETL, including data transformations. When using ER/Studio Enterprise Team Edition, visual data lineage bridges can reverse engineer from many popular ETL tools and platforms. 12

Advanced Method: Attachments and Macros The advanced method builds upon the simple approach just discussed. In fact, it uses the PostSQL in exactly the same way to generate the full DDL script.

I then use a macro to assemble the attachments and create the PostSQL clause. Defining the Attachments First, I create attachment types and attachments in the data dictionary.

14 Advanced Method: Attachments and Macros The advanced method builds upon the simple approach just discussed. In fact, it uses the PostSQL in exactly the same way to generate the full DDL script. The difference is that the advanced approach provides the ability to specify the constructs separately by using attachments in ER/Studio. I then use a macro to assemble the attachments and create the PostSQL clause. Defining the Attachments First, I create attachment types and attachments in the data dictionary. Attachment types are represented by folders, with specific attachments belonging to that type in the folder. When specifying the attachment type, I also indicate the kinds of model objects that the attachment type applies to. The example I m discussing today involves only tables, but we may have other attachments that apply to more than one type of object. This is very powerful, since I only need to set up a specific attachment once, and it can be bound As you can see in this example I have created a type for Redshift Physical Properties, with specific attachments for DISTKEY, DISTSTYLE and SORTKEY. I will now quickly show the setup of each: DISTSTYLE has been set up with a description to describe how it is used. The specified content is a list of values, with choices of EVEN or ALL. I have set EVEN as the default value, since it is also the default behaviour in Redshift. I purposely excluded a value for KEY, since it is assumed if we provide a DISTKEY. It also minimizes the amount of information I need to provide for each table. 13

15 DISTKEY has been set up with a description to describe how it is used. The specified content is text, to enable entry of the DDL clause containing 1 or more columns that are part of the DISTKEY. 14

16 SORTKEY is similar to DISTKEY, as it has a detailed description and the value itself is text, so the full SORTKEY SQL clause can be specified. 15

17 To specify the information for a specific table, I simply click on the Attachment Bindings tab within the table editor. I have shown the specified information for the sales (fact) table below: 16

18 Once I have defined the attachment values for all my tables, I can execute a macro that I created, which will read the bound attachments for each table and create the post SQL clause. I have chosen to build the macro so that it uses the tables I have selected on the screen. That allows me the flexibility to update individual tables, groups of tables, or all of them very easily. I can execute the macro from the macro tab, or even add it to the macro shortcuts menu of ER/Studio Data Architect. The macro editing language in ER/Studio is WinWrap basic, which is very flexible and powerful. A portion of the code is shown below: Because I have used attachments, I now have the ability to show the important DISTKEY, DISTSTYLE and SORTKEY information directly on the model diagram itself. The model also classifies the tables as facts, dimensions and subcategories within each. Other types can also be specified, such as snowflake, bridge, hierarchy navigation, and undefined. ER/Studio can interrogate the model and automatically identify the table types as well, if desired. 17

19 This enables improved communication and understanding, as well as providing high quality documentation. The major benefit is that the specifications are contained in the model AND generated from the model. Model driven design is extremely powerful, productive. It yields high quality, consistent results. 18

20 Starting with an Existing Implementation We don t always have the luxury of starting with a clean-sheet design. As a consultant, many of my engagements first required assessment of an organization s data environment. To do so, I relied heavily on ER/Studio s reverse engineering capabilities, allowing me to construct a blueprint of the data landscape as a basis for analysis, enhancements, and redesign. The exercise of doing so typically uncovers many existing deficiencies and inconsistencies that require remediation, especially if the databases were implemented directly, without data modeling. In my experience, those implemented quickly as part of a development project often have data structures skewed toward the easiest programming solution, rather than reflecting the business needs and rules of the organization. There may also be missing foreign keys and inconsistent use of datatypes. Therefore, reverse engineering and analysis presents an opportunity to introduce significant quality and performance improvements. For a platform that does not have specific named platform support in ER/Studio, we may still have multiple approaches for reverse engineering: 1) Use reverse engineering specified for an earlier version of the same platform. This usually works fine, unless the DBMS vendor has made significant connectivity changes, or dropped significant features in the later release. Specific enhancements aligning with new platform features might not be in Data Architect yet, but they can usually be overcome with approaches I outlined earlier in this document. 2) Generic DBMS Support using ODBC. Generally, when an ODBC driver is available for the platform, we can usually connect for reverse engineering purposes. Again, we can augment using approaches already described. 3) MetaWizard Import Bridges. ER/Studio Enterprise Team Edition includes a wide variety of metadata bridges that can import metadata into ER/Studio models. For other editions, MetaWizard bridges can be purchased as add-on licenses. For this example using Redshift, I can use option 2 or 3 above. I usually try both, then proceed with the approach that yields the results that I feel are most practical. The choice can vary based on the quality of the implemented database that I am reverse engineering. In this instance, using ODBC to reverse engineer to a generic DBMS could offer some distinct advantages. If the implemented database has specified primary keys (even though they are for documentation), the ODBC driver will recognize them and create the model accordingly. It will also recognize foreign key constraints if specified (again, only for documentation). Even if they were not, ER/Studio has the capability to infer foreign key relationships through column name matching. In this instance, the MetaWizard for Redshift imports the metadata, creating a PostgreSQL 8.0 physical model and a logical model. That corresponds to the platform choice I made earlier, when designing from scratch. However, the MetaWizard only pulls the physical specifications, so we will not get primary keys or foreign key relationships. Thus, I will need to expend a bit of effort to analyze and update the model accordingly. I will now show each approach. The database that I reverse engineer was created from the DDL in Appendix B, which is the same DDL used by the Amazon Redshift TICKIT tutorial. 19

21 Reverse Engineer Redshift as Generic DBMS using ODBC In ER/Studio Data Architect, I begin by creating a new model, selecting the option to reverse engineer from an existing database. This will step me through the reverse engineer wizard. On the first screen, I specify ODBC and will select (or create) my data source. 20

22 Clicking the ODBC setup button will show the data sources: The configuration of the data source used in this example is as follows: 21

When setting up an ODBC driver for Redshift, it is important to review and alter the default data type configuration, which is: I have obtained the best results by un-checking the data

Ensure that the Unicode option is off (unchecked) or string data types will come into the model with declared lengths that are twice as long as they are in the database itself.

23 When setting up an ODBC driver for Redshift, it is important to review and alter the default data type configuration, which is: I have obtained the best results by un-checking the data type options. Bypassing this step will result in incorrect data type mapping from Redshift. In particular, string lengths are impacted. Ensure that the Unicode option is off (unchecked) or string data types will come into the model with declared lengths that are twice as long as they are in the database itself. After selecting or specifying the data source, proceeding will result in the following pop-up message: This is normal behaviour, since Redshift 8.x is not a specific named DBMS platform in ER/Studio. Clicking on the Yes option will proceed normally. 22

24 On screen 2 of the wizard, I include user tables only. On screen 3 of the wizard, I select the tables from my TICKIT example. I have excluded the auto health check table that Redshift creates as part of the implementation. 23

25 I am hopeful that the database was created with primary keys defined. Therefore, I select the option to infer foreign keys from names, which will match column names across tables to infer relationships. I am also able to save the reverse engineering parameters into a quick launch file. 24

26 Clicking the Finish button will execute the process, providing a progress log. 25

27 Pro tip: for fastest results, choose the circular layout option on screen 4 of the wizard when working with very large databases. The following model wing shows my reverse engineered Redshift database, after applying some very basic layout changes. I would now proceed with my analysis and modeling, adding the constructs such as DISTTYPE, DISTKEY and SORTKEY as discussed previously. 26

28 MetaWizard Import from Redshift into PostgreSQL 8.0 Model I can import from Redshift using the MetaWizard by selecting: File -> Import File -> From External Metadata NOTE: To use the MetaWizard for Redshift, the Redshift 4.1 JDBC driver is required. The current MetaWizard version is not compatible with the Redshift JDBC driver, which was released by Amazon subsequent to the current MetaWizard build. Please see the help text from the MetaWizard Driver stated below, stating that the 4.1 driver is required Clicking the dropdown field on the first screen allows me to select Amazon Redshift from an extensive list of platforms and other data sources. 27

29 For my TICKIT example, I specified the parameters as follows: On the second screen, there are additional parameters to specify, including the name of the model file to create. I have the option to reverse engineer to a relational or dimensional model, just as I did using ODBC. 28

30 A detailed import progress screen is then presented. 29

Unlike the Generic DBMS using ODBC, primary keys, foreign keys and relationships are not created by the Metadata import, even if they exist in the database as documentation.

31 Unlike the Generic DBMS using ODBC, primary keys, foreign keys and relationships are not created by the Metadata import, even if they exist in the database as documentation. Therefore, additional effort is required to specify the keys and relationships, as well as DISTSTYLE, DISTKEY, SORTKEY and other parameters. If the Redshift Database does not contain primary and foreign keys, I would personally use the MetaWizard, since it creates a Postgres 8.0 physical model, which I would prefer to use going forward. NOTE: ER/Studio is very flexible. I can change a Generic DBMS physical model to another physical platform (including PostgreSQL), or vice versa. It will convert the datatypes to those supported by the target platform. I can also use the logical model to generate multiple physical models for different platforms. This is a benefit of the advanced ER/Studio architecture, which supports loosely coupled models. 30

32 Final Steps Whenever I reverse engineer a data model from a database, I always perform additional validation checks to ensure that I have specified parameters and options correctly. In cases like this example, where I know certain constructs are not included in reverse engineering, I will analyze the internal database catalog tables building some queries to extract additional metadata. I can then use that metadata as a guide to make additional model changes manually. Here are a couple of helpful Redshift catalog queries: To extract DISTSYLE specifications: select relname, reldiststyle from pg_class where relnamespace = 2200; relname reldiststyle category_pkey 0 date_pkey 0 venue_pkey 0 event_pkey 0 category 8 venue 8 users_pkey 0 listing_pkey 0 sales_pkey 0 date 1 users 0 event 1 sales 1 listing 1 The values above can be decoded as follows: RELDISTSTYLE Distribution style 0 EVEN 1 KEY 8 ALL 31

33 To extract existing DISTKEY and SORTKEY specifications select * from pg_table_def where schemaname = 'public'; schemaname tablename column type encoding distkey sortkey notnull public category catid smallint none FALSE 1 TRUE public category catgroup character varying(10) lzo FALSE 0 FALSE public category catname character varying(10) lzo FALSE 0 FALSE public category catdesc character varying(50) lzo FALSE 0 FALSE public category_pkey catid smallint none FALSE 1 FALSE public date dateid smallint none TRUE 1 TRUE public date caldate date lzo FALSE 0 TRUE public date day character(3) lzo FALSE 0 TRUE public date week smallint lzo FALSE 0 TRUE public date month character(5) lzo FALSE 0 TRUE public date qtr character(5) lzo FALSE 0 TRUE public date year smallint lzo FALSE 0 TRUE public date holiday boolean none FALSE 0 FALSE public date_pkey dateid smallint none TRUE 1 FALSE public event eventid integer none TRUE 1 TRUE public event venueid smallint lzo FALSE 0 FALSE public event catid smallint lzo FALSE 0 FALSE public event dateid smallint lzo FALSE 0 FALSE public event eventname character varying(200) lzo FALSE 0 FALSE public event starttime timestamp without time zone lzo FALSE 0 FALSE public event_pkey eventid integer none TRUE 1 FALSE public listing listid integer none TRUE 1 TRUE public listing sellerid integer lzo FALSE 0 FALSE public listing eventid integer lzo FALSE 0 FALSE public listing dateid smallint lzo FALSE 0 FALSE public listing numtickets smallint lzo FALSE 0 TRUE public listing priceperticket numeric(8,2) lzo FALSE 0 FALSE public listing totalprice numeric(8,2) lzo FALSE 0 FALSE public listing listtime timestamp without time zone lzo FALSE 0 FALSE public listing_pkey listid integer none TRUE 1 FALSE public sales salesid integer lzo TRUE 0 TRUE public sales listid integer none FALSE 1 FALSE public sales sellerid integer none FALSE 2 FALSE public sales buyerid integer lzo FALSE 0 FALSE public sales eventid integer lzo FALSE 0 FALSE public sales dateid smallint lzo FALSE 0 FALSE public sales qtysold smallint lzo FALSE 0 TRUE public sales pricepaid numeric(8,2) lzo FALSE 0 FALSE public sales commission numeric(8,2) lzo FALSE 0 FALSE public sales saletime timestamp without time zone lzo FALSE 0 FALSE public sales_pkey salesid integer lzo TRUE 0 FALSE public users userid integer none FALSE 1 TRUE public users username character(8) lzo FALSE 0 FALSE public users firstname character varying(30) lzo FALSE 0 FALSE public users lastname character varying(30) lzo FALSE 0 FALSE public users city character varying(30) lzo FALSE 0 FALSE public users state character(2) lzo FALSE 0 FALSE public users character varying(100) lzo FALSE 0 FALSE 32

34 schemaname tablename column type encoding distkey sortkey notnull public users phone character(14) lzo FALSE 0 FALSE public users likesports boolean none FALSE 0 FALSE public users liketheatre boolean none FALSE 0 FALSE public users likeconcerts boolean none FALSE 0 FALSE public users likejazz boolean none FALSE 0 FALSE public users likeclassical boolean none FALSE 0 FALSE public users likeopera boolean none FALSE 0 FALSE public users likerock boolean none FALSE 0 FALSE public users likevegas boolean none FALSE 0 FALSE public users likebroadway boolean none FALSE 0 FALSE public users likemusicals boolean none FALSE 0 FALSE public users_pkey userid integer none FALSE 1 FALSE public venue venueid smallint none FALSE 1 TRUE public venue venuename character varying(100) lzo FALSE 0 FALSE public venue venuecity character varying(30) lzo FALSE 0 FALSE public venue venuestate character(2) lzo FALSE 0 FALSE public venue venueseats integer lzo FALSE 0 FALSE public venue_pkey venueid smallint none FALSE 1 FALSE 33

35 To validate column datatypes and sequencing in tables it can also be helpful to run the following query (partial results shown) select * from svv_columns where table_schema = 'public' order by table_name, ordinal_position character_ maximum _length numeric_ precision_ ordinal_ column_ is_nulla numeric_ table_catalog table_schema table_name column_name position default ble data_type precision radix dev public category catid 1 NO smallint dev public category catgroup 2 YES character varying 10 dev public category catname 3 YES character varying 10 dev public category catdesc 4 YES character varying 50 dev public date dateid 1 NO smallint dev public date caldate 2 NO date dev public date day 3 NO character 3 dev public date week 4 NO smallint dev public date month 5 NO character 5 dev public date qtr 6 NO character 5 dev public date year 7 NO smallint dev public date holiday 8 FALSE YES boolean dev public event eventid 1 NO integer dev public event venueid 2 YES smallint dev public event catid 3 YES smallint dev public event dateid 4 YES smallint dev public event eventname 5 YES character varying 200 dev public event starttime 6 YES timestamp without time zone dev public listing listid 1 NO integer dev public listing sellerid 2 YES integer dev public listing eventid 3 YES integer dev public listing dateid 4 YES smallint dev public listing numtickets 5 NO smallint dev public listing priceperticket 6 YES numeric dev public listing totalprice 7 YES numeric dev public listing listtime 8 YES timestamp without time zone dev public sales salesid 1 NO integer dev public sales listid 2 YES integer dev public sales sellerid 3 YES integer dev public sales buyerid 4 YES integer dev public sales eventid 5 YES integer dev public sales dateid 6 YES smallint dev public sales qtysold 7 NO smallint dev public sales pricepaid 8 YES numeric dev public sales commission 9 YES numeric dev public sales saletime 10 YES timestamp without time zone dev public users userid 1 NO integer dev public users username 2 YES character 8 dev public users firstname 3 YES character varying 30 dev public users lastname 4 YES character varying 30 dev public users city 5 YES character varying 30 dev public users state 6 YES character 2 dev public users 7 YES character varying 100 dev public users phone 8 YES character 14 dev public users likesports 9 YES boolean dev public users liketheatre 10 YES boolean dev public users likeconcerts 11 YES boolean dev public users likejazz 12 YES boolean dev public users likeclassical 13 YES boolean dev public users likeopera 14 YES boolean dev public users likerock 15 YES boolean dev public users likevegas 16 YES boolean dev public users likebroadway 17 YES boolean dev public users likemusicals 18 YES boolean dev public venue venueid 1 NO smallint dev public venue venuename 2 YES character varying 100 dev public venue venuecity 3 YES character varying 30 dev public venue venuestate 4 YES character 2 dev public venue venueseats 5 YES integer numeric _scale 34

36 Conclusion Throughout the document I have highlighted a portion of the ER/Studio advanced architecture and modeling capabilities. While this is the end of this particular topic, it marks the beginning of many capabilities that you can discover and apply in your own environment. For example, rather than manually comparing the results of the Redshift Catalog queries to my model, I could use an even more advanced approach by first downloading the query results into a file, such as an excel workbook. Then, I could build reusable macros to read from the file and create the additional metadata in the model for me. The approach is very similar to how I built the macro to populate the postsql clause from a table s bound attachments. I hope you have found this document to be a helpful guide as you embark upon modeling and generating Redshift data warehouses with ER/Studio. These principles apply to other platforms as well. ER/Studio s advanced architecture will allow you to define and create additional metadata for any model construct, as well as the ability to extend generated DDL with additional syntax (pre and post SQL. Capabilities are extended with custom datatypes and datatype mappings for the platforms you wish to work with. Full macro programming capability using Winwrap basic, combined with an extensive automation engine allows you to customize the capabilities, limited only by your imagination. Having these new capabilities will provide you with huge productivity benefits. Have fun impressing your boss with your new modeling superpowers! 35

37 Appendix A: DDL Generated from My TICKIT Model CREATE TABLE category( Catid int2 NOT NULL, Catgroup varchar(10), Catname varchar(10), Catdesc varchar(50), PRIMARY KEY (catid) ) DISTSTYLE ALL SORTKEY(catid) ; TABLE: date -- CREATE TABLE date( Dateid int2 NOT NULL, Caldate date NOT NULL, Day char(3) NOT NULL, Week int2 NOT NULL, Month char(5) NOT NULL, Qtr char(5) NOT NULL, Year int2 NOT NULL, Holiday boolean DEFAULT false, PRIMARY KEY (dateid) ) DISTKEY(dateid) SORTKEY(dateid) ; TABLE: venue -- CREATE TABLE venue( Venueid int2 NOT NULL, Venuename varchar(100), Venuecity varchar(30), Venuestate char(2), Venueseats int4, PRIMARY KEY (venueid) ) DISTSTYLE ALL SORTKEY(venueid) ; TABLE: event -- CREATE TABLE event( Eventid int4 NOT NULL, Venueid int2, Catid int2, Dateid int2, Eventname varchar(200), Starttime timestamp, PRIMARY KEY (eventid), 36

38 FOREIGN KEY (catid) REFERENCES category(catid), FOREIGN KEY (dateid) REFERENCES date(dateid), FOREIGN KEY (venueid) REFERENCES venue(venueid) ) DISTKEY(eventid) SORTKEY(eventid) ; TABLE: users -- CREATE TABLE users( Userid int4 NOT NULL, Username char(8), Firstname varchar(30), Lastname varchar(30), City varchar(30), State char(2), varchar(100), Phone char(14), Likesports boolean, Liketheatre boolean, Likeconcerts boolean, Likejazz boolean, Likeclassical boolean, Likeopera boolean, Likerock boolean, Likevegas boolean, Likebroadway boolean, Likemusicals boolean, PRIMARY KEY (userid) ) DISTSTYLE EVEN SORTKEY(userid) ; TABLE: listing -- CREATE TABLE listing( Listid int4 NOT NULL, Sellerid int4, Eventid int4, Dateid int2, Numtickets int2 NOT NULL, Priceperticket numeric(8, 2), Totalprice numeric(8, 2), Listtime timestamp, PRIMARY KEY (listid), FOREIGN KEY (eventid) REFERENCES event(eventid), FOREIGN KEY (sellerid) REFERENCES users(userid), FOREIGN KEY (dateid) 37

39 REFERENCES date(dateid) ) DISTKEY(listid) SORTKEY(listid) ; TABLE: sales -- CREATE TABLE sales( Salesid int4 NOT NULL, Listid int4, Sellerid int4, Buyerid int4, Eventid int4, Dateid int2, Qtysold int2 NOT NULL, Pricepaid numeric(8, 2), Commission numeric(8, 2), Saletime timestamp, PRIMARY KEY (salesid), FOREIGN KEY (dateid) REFERENCES date(dateid), FOREIGN KEY (sellerid) REFERENCES users(userid), FOREIGN KEY (buyerid) REFERENCES users(userid), FOREIGN KEY (eventid) REFERENCES event(eventid), FOREIGN KEY (listid) REFERENCES listing(listid) ) DISTKEY(salesid) SORTKEY(listid, sellerid) ; 38

40 Appendix B: TICKIT Database DDL in Standard Redshift Tutorial create table users( userid integer not null distkey sortkey, username char(8), firstname varchar(30), lastname varchar(30), city varchar(30), state char(2), varchar(100), phone char(14), likesports boolean, liketheatre boolean, likeconcerts boolean, likejazz boolean, likeclassical boolean, likeopera boolean, likerock boolean, likevegas boolean, likebroadway boolean, likemusicals boolean); create table venue( venueid smallint not null distkey sortkey, venuename varchar(100), venuecity varchar(30), venuestate char(2), venueseats integer); create table category( catid smallint not null distkey sortkey, catgroup varchar(10), catname varchar(10), catdesc varchar(50)); create table date( dateid smallint not null distkey sortkey, caldate date not null, day character(3) not null, week smallint not null, month character(5) not null, qtr character(5) not null, year smallint not null, holiday boolean default('n')); create table event( eventid integer not null distkey, venueid smallint not null, catid smallint not null, dateid smallint not null sortkey, eventname varchar(200), starttime timestamp); create table listing( listid integer not null distkey, sellerid integer not null, eventid integer not null, dateid smallint not null sortkey, numtickets smallint not null, priceperticket decimal(8,2), 39

totalprice decimal(8,2), listtime timestamp); create table sales( salesid integer not null, listid integer not null distkey, sellerid integer not null, buyerid integer

41 totalprice decimal(8,2), listtime timestamp); create table sales( salesid integer not null, listid integer not null distkey, sellerid integer not null, buyerid integer not null, eventid integer not null, dateid smallint not null sortkey, qtysold smallint not null, pricepaid decimal(8,2), commission decimal(8,2), saletime timestamp); 40

Amazon Redshift. Getting Started Guide API Version

Amazon Redshift. Getting Started Guide API Version Amazon Redshift Getting Started Guide Amazon Redshift: Getting Started Guide Copyright 2018 Amazon Web Services, Inc. and/or its aﬃliates. All rights reserved. Amazon's trademarks and trade dress may not