Teradata. This was compiled in order to describe Teradata and provide a brief overview of common capabilities and queries.

Size: px

Start display at page:

Download "Teradata. This was compiled in order to describe Teradata and provide a brief overview of common capabilities and queries."

Lisa Norris
5 years ago
Views:

1 Teradata This was compiled in order to describe Teradata and provide a brief overview of common capabilities and queries. What is it? Teradata is a powerful Big Data tool that can be used in order to quickly and efficiently enable users to deal with extremely large data sets. Teradata is a relational database that can store billions of rows and petabytes (1 petabyte=1000 terabytes) of data. The architecture of the system makes it possible and provides the flexibility to access and process the data in a faster way. Teradata differs from other conventional database systems from its architecture to the processing speed. Big corporations like global insurance companies are using Teradata to store customer and client information because they have a lot of it to process. The demand for this system is high due to easy scalability and its fault tolerance. What are the main components of Teradata System? Teradata has 3 main components which do wonder to the world of data management & Storage. It has 1. PE (Parsing Engine) : Acts as a gate keeper to the Teradata Systems and manages all sessions, interprets the SQL statements for any errors, manages the access rights for the user, defines a least expensive optimizer plan for the query to execute and sends the request to AMP via Bynet. 2. Message Passing Layer (Bynet): Carries messages between the AMPs and PEs, provides Point-to- Point and Broadcast communications, Merging answer sets back to the PE and Making Teradata parallelism possible 3. AMP (Access Module Processor): AMP is the heart of Teradata which does most of the operations for data storage and retrieval. It also takes care of finding the rows requested, Lock management of the tables and rows, Sorting rows, Aggregating columns, Join processing etc. What are Primary Index and Primary Key in Teradata? Unlike other database systems, Teradata distributes the data based on PI (Primary Index). PI is defined at the time of table creation and database automatically takes the first column as the PI if the PI is not mentioned explicitly. Since the data distribution is based on PI, it is wise to choose a PI that evenly distributes the data among the AMP. For example, if Table A has two columns like below and we have 5 AMPs in the System. ID Gender 1 Male 2 Male 3 Male 4 Male 5 Female If we choose ID as PI, since the values are distinct all 5 rows are distributed evenly across all 5 AMPs. But if GENDER has been chosen as PI, we have only 2 distinct values and data will be stored in only 2 AMPS leaving other 3 AMPS empty and idle. Note: Same Value of PI will be stored in the same AMP. Primary Key is a concept that uniquely identifies a particular row of a table. What are the types of PI (Primary Index) in Teradata?

2 There are two types of Primary Index. Unique Primary Index (UPI) and Non Unique Primary Index (NUPI). By default, NUPI is created when the table is created. Unique keyword has to be explicitly given when UPI has to be created. UPI wills slower the performance sometimes as for each and every row, uniqueness of the column value has to be checked and it is an additional overhead to the system but the distribution of data will be even. Care should be taken while choosing a NUPI so that the distribution of data is almost even. UPI/NUPI decision should be taken based on the data and its usage. How to Choose Primary Index (PI) in Teradata? Choosing a Primary Index is based on Data Distribution and Join frequency of the Column. If a Column is used for joining most of the tables then it is wise to choose the column as PI candidate. For example, we have an Employee table with EMPID and DEPTID and this table needs to be joined to the Department Table based on DEPTID. It is not a wise decision to choose DEPTID as the PI of the employee table. Reason being, employee table will have thousands of employees whereas number of departments in a company will be less than 100. So choosing EMPID will have better performance in terms of distribution. How the data is distributed among AMPs based on PI in Teradata? Assume a row is to be inserted into a Teradata table The Primary Index Value for the Row is put into the Hash Algorithm The output is a 32-bit Row Hash The Row Hash points to a bucket in the Hash Map. The first 16 bits of the Row Hash of is used to locate a bucket in the Hash Map The bucket points to a specific AMP The row along with the Row Hash are delivered to that AMP When the AMP receives a row it will place the row into the proper table, and the AMP checks if it has any other rows in the table with the same row hash. If this is the first row with this particular row hash the AMP will assign a 32-bit uniqueness value of 1. If this is the second row hash with that particular row hash, the AMP will assign a uniqueness value of 2. The 32-bit row hash and the 32-bit uniqueness value make up the 64-bit Row ID. The Row ID is how tables are sorted on an AMP. This uniqueness value is useful in case of NUPI's to distinguish each BUPI value. Both UPI and NUPI is always a One AMP operation as the same values will be stores in same AMP. How Teradata retrieves a row? For example, a user runs a query looking for information on Employee ID 100. The PE sees that the Primary Index Value EMP is used in the SQL WHERE clause. Because this is a Primary Index access operation, the PE knows this is a one AMP operation. The PE hashes 100 and the Row Hash points to a bucket in the Hash Map that represents AMP X. AMP X is sent a message to get the Row Hash and make sure it s EMP 100. What are Secondary Indexes (SI), types of SI and disadvantages of Secondary Indexes in Teradata? Secondary Indexes provide another path to access data. Teradata allows up to 32 secondary indexes per table. Keep in mind; row distribution of records does not occur when secondary indexes are defined. The value of secondary indexes is that they reside in a subtable and are stored on all AMPs, which is very different from how the primary indexes (part of base table) are stored. Keep in mind that Secondary Indexes (when defined) do take up additional space.

3 Secondary Indexes are frequently used in a WHERE clause. The Secondary Index can be changed or dropped at any time. However, because of the overhead for index maintenance, it is recommended that index values should not be frequently changed. There are two different types of Secondary Indexes, Unique Secondary Index (USI), and Non-Unique Secondary Index (NUSI). Unique Secondary Indexes are extremely efficient. A USI is considered a two- AMP operation. One AMP is utilized to access the USI subtable row (in the Secondary Index subtable) that references the actual data row, which resides on the second AMP. A Non-Unique Secondary Index is an All-AMP operation and will usually require a spool file. Although a NUSI is an All-AMP operation, it is faster than a full table scan. Secondary indexes can be useful for: Satisfying complex conditions Processing aggregates Value comparisons Matching character combinations Joining tables How are the data distributed in Secondary Index Subtables in Teradata? When a user creates a Secondary Index, Teradata automatically creates a Secondary Index Subtable. The subtable will contain the: Secondary Index Value Secondary Index Row ID Primary Index Row ID When a user writes an SQL query that has an SI in the WHERE clause, the Parsing Engine will Hash the Secondary Index Value. The output is the Row Hash, which points to a bucket in the Hash Map. That bucket contains an AMP number and the Parsing Engine then knows which AMP contains the Secondary Index Subtable pertaining to the requested USI information. The PE will direct the chosen AMP to look-up the Row Hash in the Subtable. The AMP will check to see if the Row Hash exists in the Subtable and double check the subtable row with the actual secondary index value. Then, the AMP will pass the Primary Index Row ID back up the BYNET network. This request is directed to the AMP with the base table row, which is then easily retrieved. What are the types of JOINs available in Teradata? Types of JOINs are: Inner Join, Outer Join (Left, Right, and Full), Self Join, Cross Join and Cartesian Joins. The key things to know about Teradata and Joins Each AMP holds a portion of a table. Teradata uses the Primary Index to distribute the rows among the AMPs. Each AMP keeps their tables separated from other tables like someone might keep clothes in a dresser drawer. Each AMP sorts their tables by Row ID. For a JOIN to take place the two rows being joined must find a way to get to the same AMP. If the rows to be joined are not on the same AMP, Teradata will either redistribute the data or duplicate the data in spool to make that happen. What are the types of Join Strategies available in Teradata?

4 Join Strategies are used by the optimizer to choose the best plan to join tables based on the given join condition. Merge (Exclusion) Nested Row Hash Product (including Cartesian Product joins) There are different types of merge join strategies available. But in general, while joining two tables the data will be redistributed or duplicated across all AMPs to make sure joining rows are in the same AMPs. If the two tables are joined based on PI, no redistribution/duplication will happen as the rows will be in the same AMP and performance will be better. If one table PI is used and Other table PI not used, redistribution/duplication of the table will happen based on the table size. In these cases Secondary Indexes will be helpful. Explain types of re-distribution of data happening for joining of columns from two tables in Teradata? Case 1 - P.I = P.I joins Case 2 - P.I = non Index joins Case 3 - non Index = non Index joins Case1 - there is no redistribution of data over amp's. Since amp local joins happen as data are present in same AMP and need not be re-distributed. These types of joins on unique primary index are very fast. Case2 - data from second table will be re-distributed on all amps since joins are happening on PI vs. non Index column. Ideal scenario is when small table is redistributed to be joined with large table records on same amp case3 - data from both the tables are redistributed on all AMPs. This is one of the longest processing queries; Care should be taken to see that stats are collected on these columns What is Partitioned Primary Index (PPI) in Teradata? Partitioned primary index is physically splitting the table into a series of subtables, one for every partitioning value. When a single row is accessed, it looks first at the partitioning value to determine the subtable, then at the primary index to calculate the row hash for the row(s). For example, we have PPI on a MONTH Column, the rows of particular months are all sorted with in the same partition and whenever data is accessed for particular month, it will retrieve the data in a faster way. It helps to avoid full table scans. What are the advantages and disadvantages of PPI in Teradata? Advantages: Range queries don t have to utilize a Full Table Scan. Deletions of entire partitions are lightning fast. PPI provides an excellent solution instead of using Secondary Indexes Tables that hold yearly information don t have to be split into 12 smaller tables to avoid Full Table Scans (FTS). This can make modeling and querying easier. Fast load and Multifood work with PPI tables, but not with all Secondary Indexes. Disadvantages: A two-byte Partition number is added to the ROW-ID and it is now called a ROW KEY. The two-bytes per row will add more Perm Space to a table. Joins to Non-Partitioned Tables can take longer and become more complicated for Teradata to perform. Basic select queries utilizing the Primary Index can take longer if the Partition number is not also mentioned in the WHERE clause of the query.

5 You can t have a Unique Primary Index (UPI) if the Partition Number is not at least part of the Primary Index. You must therefore create a Unique Secondary Index to maintain uniqueness. Volatile and Global Temporary Tables in Teradata? Volatile tables are temporary tables that are materialized in spool and are unknown to the Data Dictionary. A volatile table may be utilized multiple times and in more than one SQL statement throughout the life of a session. This feature allows for additional queries to utilize the same rows in the temporary table without requiring the rows to be rebuilt. Volatile tables are local to session and the tables are dropped once the session is disconnected. ON COMMIT PRESERVE ROWS option should be mentioned at the time of table creation. It means that at the end of a transaction, the rows in the volatile table will not be deleted. The information in the table remains for the entire session. Users can ask questions to the volatile table until they log off. Then the table and data go away. Global Temporary Tables are similar to volatile tables in that they are local to a user s session. However, when the table is created, the definition is stored in the Data Dictionary. In addition, these tables are materialized in a permanent area known as Temporary Space. Because of these reasons, global tables can survive a system restart and the table definition will not discard at the end of the session. However, when a system restarts, the rows inside the Global Temporary Table will be removed. Lastly, Global tables require no spool space. They use Temp Space. Statistics can be collected in both of the tables in TD13 Version. Previously Collecting Stats on Volatile tables are not allowed. Sub Query and Correlated Sub query in Teradata? Sub queries and Correlated Sub queries are two important concepts in Teradata and used most of the times. The basic concept behind a sub query is that it retrieves a list of values that are used for comparison against one or more columns in the main query. Here the sub query is executed first and based on the result set, the main query will be executed. For example, Select empname, deptname from employee where empid IN (select empid from salarytable where salary>10000). In the above query, empid will be chosen first based on the salary in the sub query and main query will be executed based on the result subset. Correlated Sub query is an excellent technique to use when there is a need to determine which rows to SELECT based on one or more values from another table. It combines sub query processing and Join processing into a single request. It first reads a row from the main query and then goes into the sub query to find the rows that match the specified column value. Then it goes for the next row from the main query. This process continues until all the qualifying rows from MAIN query. For example, select empname,deptno, salary from employeetable as emp where salary=(select max(salary) from employeetable as emt where emt.deptno=emp.deptno) Above query returns the highest paid employee from each department. This is also one of the scenario based questions in Teradata.

6 How to calculate the table size, database size and free space left in a database in Teradata? DBC.TABLESIZE and DBC.DISKSPACE are the systems tables used to find the space occupied. Below Query gives the table size of each table in the database and it will be useful to find the big tables in case of any space recovery. SELECT DATABASENAME, TABLENAME, SUM(CURRENTPERM/(1024*1024*1024)) AS "TABLE SIZE" FROM DBC.TABLESIZE WHERE DATABASENAME = <'DATABASE_NAME'> AND TABLENAME =< 'TABLE_NAME'> GROUP BY 1,2; Below query gives the total space and free space available in a database. SELECT DATABASENAME DATABASE_NAME, SUM(MAXPERM)/(1024*1024*1024) TOTAL_PERM_SPACE, SUM(CURRENTPERM)/(1024*1024*1024) CURRENT_PERM_SPACE, TOTAL_PERM_SPACE-CURRENT_PERM_SPACE as FREE_SPACE FROM DBC.DISKSPACE WHERE DATABASENAME =< 'DATABASE_NAME'> group by 1; What are the Performance improvement techniques available in Teradata? First of all use EXPLAIN plan to see how the query is performing. Keywords like Product joins, low confidence are measures of poor performance. Make Sure, STATS are collected on the columns used in WHERE Clause and JOIN columns. If STATS are collected, explain plan will show HIGH CONFIDENCE This tells the optimizer about the number of rows in that table which will help the optimizer to choose the redistribution/duplication of smaller tables. Check the joining columns & WHERE Clause whether PI, SI or PPI are used. Check whether proper alias names are used in the joining conditions. Split the queries into smaller subsets in case of poor performance. What does Pseudo Table Locks mean in EXPLAIN Plan in Teradata? It is a false lock which is applied on the table to prevent two users from getting conflicting locks with all- AMP requests. PE will determine a particular AMP to manage all AMP LOCK requests for given table and Put Pseudo lock on the table. Can you compress a column which is already present in table using ALTER in Teradata? No, We cannot use ALTER command to compress the existing columns in the table. A new table structure has to be created which includes the Compression values and data should be inserted into Compress column table.

7 Please note - ALTER can be used only to add new columns with compression values to table. How to create a table with an existing structure of another table with or without data and also with stats defined in Teradata? CREATE TABLE new_table AS old_table WITH DATA CREATE TABLE new_table AS old_table WITH NO DATA CREATE TABLE new_table AS old_table WITH DATA AND STATS How to find the duplicate rows in the table in Teradata? Group by those fields and add a count greater than 1 condition for those columns For example SELECT name, COUNT (*) FROM TABLE EMPLOYEE GROUP BY name HAVING COUNT (*)>1; Also DISTINCT will be useful. If both DISTINCT and COUNT(*) returns same number then there are no duplicates. Which is more efficient GROUP BY or DISTINCT to find duplicates in Teradata? With more duplicates GROUP BY is more efficient while if we have fewer duplicates the DISTINCT is efficient. What is the difference between TIMESTAMP (0) and TIMESTAMP (6) in Teradata? Both have the Date and Time Values. The major difference is that TIMESTAMP (6) has microsecond too. What is spool space and when running a job if it reached the maximum spool space how you solve the problem in Teradata? Spool space is the space which is required by the query for processing or to hold the rows in the answer set. Spool space reaches maximum when the query is not properly optimized. We must use appropriate condition in WHERE clause and JOIN on correct columns to optimize the query. Also make sure unncessary volatile tables are dropped as it occupies spool space. Why does varchar occupy 2 extra bytes? The two bytes are for the number of bytes for the binary length of the field. It stores the exact no of characters stored in varchar What is the difference between User and database in Teradata? - User is a database with password but database cannot have password - Both can contain Tables, views and macros - Both users and databases may or may not hold privileges - Only users can login, establish a session with Teradata database and they can submit requests What are the types of HASH functions used in Teradata? These are the types of HASH, HASHROW, HASHAMP and HASHBAKAMP. Their SQL functions are- HASHROW (column(s)) HASHBUCKET (hashrow) HASHAMP (hashbucket) HASHBAKAMP (hashbucket) To find the data distribution of a table based on PI, below query will be helpful. This query will give the number of records in each AMP for that particular table.

8 SELECT HASHAMP(HASHBUCKET(HASHROW(PI_COLUMN))),COUNT(*) FROM TABLENBAME GROUP BY 1.

Teradata Basics Class Outline

Teradata Basics Class Outline CoffingDW education has been customized for every customer for the past 20 years. Our classes can be taught either on site or remotely via the internet. Education Contact: