Joining Tables with SQL: The most important econometrics lesson you may ever learn

Size: px

Start display at page:

Download "Joining Tables with SQL: The most important econometrics lesson you may ever learn"

Jeffrey Goodman
5 years ago
Views:

econometrics lesson you may ever learn Matt Bogard, Western

1 Western Kentucky University From the SelectedWorks of Matt Bogard Summer June, 2015 Joining Tables with SQL: The most important econometrics lesson you may ever learn Matt Bogard, Western Kentucky University Available at:

2 Joining Tables with SQL: The most important econometrics lesson you may ever learn Introduction Students of econometrics might often spend their days learning proofs and theorems, and if they are lucky they will get their hands on some data and access to software to actually practice some applied work rather it be for a class project or part of a thesis or dissertation. I have written before about the large gap between theoretical and applied econometrics, but there is another gap to speak of, and it has nothing to do with theoretical properties of estimators or interpreting output from STATA, SAS or R. This has to do with raw coding, hacking, and data manipulation skills. The ability to tease out relevant observations and measures from both large structured transactional databases or unstructured log files or tweetstreams. This gap becomes more of an issue as econometricians move from more academic environments to corporate environments and especially so for those economists that begin to take on roles as data scientists. In these environments, not only is it true that problems don t fit the standard textbook solutions (see article Applied Econometrics ), but the data does not exist in any form that is in any way like the simple data sets used in textbooks. One cannot always expect their IT people to be able to just dump them a flat file with all the variables and formats that will work for your research project. In fact, the absolute best you might hope for in many environments is a SQL or Oracle data base with hundreds or thousands of tables and the tiny bits of information you need spread across a number of them. How do you bring all of this information together to do an analysis? This can be complicated, but for the uninitiated I will present some toy examples to give a feel for executing basic database queries to bring together different pieces of information housed in separate tables in order to produce a toy analytics ready data set. We ll use SQL (structured query language) commands in SAS using PROC SQL. Something similar could be done using the sqldf package in R, or other tools. The Business Scenario Suppose you are a seed company and you have a data warehouse that houses all of your customer data, as well as data about your products (hyrbrids) and technology*. You also have implemented a big-data IoT (internet of things) project where customers that plant your hybrids will upload their yields via some web ap. This is very important, because now, instead of just limiting our marketing and R&D work to expensive and planned field experiments, you can tab terabytes of observational data related to your product performance. You are an econometrician just hired as a data scientist to analyze this data. But, you find that it is spread across four different tables in the corporate enterprise data ware house (here we ignore the many challenges involved related to schemas, cardinality etc. or more advanced architectures like Hadoop commonly used in big data applications). This is one of the most common ways that data is stored in business and analytic environments. The CUSTOMER table has demographic information, history, customer IDs etc. We will consider just the number of planted acres, but this table could in theory include a lot of other important information we would want to consider in an analysis.

3 ID ACRES The PRODUCT table tells which products the customer bought, tying product to customer ID. ID TYPE 1 G G G P P G G P G8590 The TECH or technology table contains each product sold by the company and the technology or particular trait associated with that product. In this table each product type there is a corresponding trait or technology (again a big oversimplification of actual corn hybrid data). TYPE G8590 G8484 P8787 TRAIT BT RR RW Suppose that you store all of the uploaded web-ap yield data in a table called CUSTOMER_YIELD. ID YIELD

4 Using SQL to Create an Analytics Ready Data Source In SAS the PROC SQL procedure allows you to bring these disparate sources of data together. If you observe the tables closely, you should be able to see that you can link the CUSTOMER table to the PRODUCT table by referencing or matching on the variable ID which is in each table. If we simply want to keep all of the existing data (all of the customers) in the CUSTOMER table and add in the products they bought (from the PRODUCT table) then we execute what is referred to as a LEFT JOIN using the following code in SAS to create a new combined table called TEMP1_ADD_PRODUCT. PROC SQL CREATE TABLE TEMP1_ADD_PRODUCT AS SELECT A.*, B.TYPE FROM CUSTOMER A LEFT JOIN PRODUCT B ON A.ID = B.ID QUIT R: library(sqldf)# required sqldf library temp1_add_product <- sqldf('select a.*,b.type from customer a left join product b on a.id = b.id' ) Python: import pandas as pd # package for data manipulation temp1_add_product = pd.merge(customer,product[['type','id']], on='id', how='left') The SELECT statement tells the procedure which variables from each table to select for the new data set. Each table is referenced by an alias in this as A is the designated alias for the CUSTOMER table while B is the designated alias for the PRODUCT table. The reference A.* indicates we want to select all of the variables in the data set associated with A. The reference to B.TYPE indicates that we only are interested in adding the TYPE variable from the PRODUCT table. (in more complex real world applications tables could contain numerous variables and we often only want certain key variables from each table). We tell the procedure to get the data FROM the CUSTOMER table (designated with alias A) and execute a LEFT JOIN with the PRODUCT table (designated with the alias B). We tell the procedure to join the two tables based ON the ID variable in each respective table. (variables like this that link information between different tables are often called keys ). The syntax is different in Python using pandas, but the logic is analogous. The output data set is below:

5 ID ACRES TYPE G G G P P G G P G8590 Now we have customer demographics (ID & acres in this simplified example) combined with the product or variety of seed they planted. Next we want to determine what kind of technology or genetic trait is associated with each type of seed or variety they purchased. We can do this by executing another LEFT JOIN between TEMP1_ADD_PRODUCT and the data in the TECH table. Notice this time the common key between these two tables that links the product to its associated technology is the variable TYPE (which we just added when we created TEMP1_ADD_PRODUCT). The result of this next join is a table we will call TEMP2_ADD_TECH. PROC SQL CREATE TABLE TEMP2_ADD_TECH AS SELECT A.*, B.TRAIT FROM TEMP1_ADD_PRODUCT A LEFT JOIN TECH B ON A.TYPE = B.TYPE QUIT R: temp2_add_tech <- sqldf('select a.*,b.trait from temp1_add_product a left join tech b on a.type = b.type' ) Python: temp2_add_tech = pd.merge(temp1_add_product,tech[['type','trait']], on='type', how='left') ID ACRES TYPE TRAIT G8484 RR G8590 BT G8590 BT G8590 BT G8590 BT G8590 BT P8787 RW P8787 RW P8787 RW

6 Finally we want to get the yield data for each customer s selected variety (product) and the associated trait or technology. Since customers upload their yield data via a web or mobile ap, each yield is going to be associated with the customer ID, so we can join yield data based on the common key ID. (a big simplification in this example is a customer has only one yield data point, but in real world applications they could have multiple farms and fields with multiple data points within each field). PROC SQL CREATE TABLE TEMP3_ADD_YIELD AS SELECT A.*, B.YIELD FROM TEMP2_ADD_TECH A LEFT JOIN CUSTOMER_YIELD B ON A.ID = B.ID QUIT R: temp3_add_yield <- sqldf('select a.*, b.yield from temp2_add_tech a left join customer_yield b on a.id = b.id' ) Python: temp3_add_yield = pd.merge(temp2_add_tech,customer_yield[['id','yield']], on='id', how='left') ID ACRES TYPE TRAIT YIELD G8590 BT G8590 BT G8484 RR P8787 RW P8787 RW G8590 BT G8590 BT P8787 RW G8590 BT 170 Now we have an analytic ready data set that we could use to analyze differences in yield by product TYPE. In each case we added the specific information we required form specific tables building out the final data set. We did this in a number of separate SQL statements executed in SAS. With each additional join, we created an intermediate data set. For illustrative purposes or with small data sets this approach is fine. In more realistic applications, where each intermediate table might consist of millions of rows of

7 data, we would want to be more efficient. The following block of SAS code completes all of the joins and the final data set at once, vs. creating a number of temporary intermediate data sets. PROC SQL CREATE TABLE TEMP1_ADD_ALL AS SELECT A.*, B.TYPE, C.TRAIT, D.YIELD FROM CUSTOMER A LEFT JOIN PRODUCT B ON A.ID = B.ID LEFT JOIN TECH C ON B.TYPE = C.TYPE LEFT JOIN CUSTOMER_YIELD D ON A.ID = D.ID QUIT R: temp1_add_all <- sqldf('select a.*, b.type, c.trait, d.yield from customer a left join product b on a.id =b.id left join tech c on b.type = c.type left join customer_yield d on a.id = d.id' ) Python: temp1_add_all = pd.merge(pd.merge(pd.merge(customer,product[['type','id']],on='id',how='left ),tech[['type','trait']],on='type',how = 'left'), customer_yield[['id','yield']], on='id', how='left') Conclusion My goal is not to teach anyone to be a SQL programmer, but simply introduce you to the paradigm of transactional data bases and what it takes to derive a data set suitable for analysis in a non-academic setting. For further reading about big data applications and econometrics see below. Further Reading: Applied Econometrics: Economists as Data Scientists: Econometrics and Big Data: Is machine learning trending with economists? Big data, John Deere, and the internet of things.

8 The Data Science Venn Diagram Big Ag Meets Big Data Notes: 1) Bt refers to a technology or genetic trait in plants that allows them to express Bt proteins which are toxic to certain pests 2) RR refers to a technology or genetic trait in plants that allows them to be resistant to the herbicide Roundup. Appendix: SAS, R and Python Code for Building Demo Data Mart SAS Code: *SET UP TOY CUSTOMER DATA BASE DATA CUSTOMER INPUT ID ACRES CARDS RUN DATA PRODUCT INPUT ID TYPE $ CARDS 1 G G G P P G G P G8590 RUN

9 DATA TECH INPUT TYPE $ TRAIT $ CARDS G8590 BT G8484 RR P8787 RW RUN DATA CUSTOMER_YIELD INPUT ID YIELD CARDS RUN R Code: # generate data fields id <- numeric() id <- c(1,2,3,4,5,6,7,8,9) acres <- c(1800,1970,980,960,970,1500,700,2500,2980) type <- c('g8590','g8590','g8484','p8787','p8787','g8590','g8590','p8787','g8590') trait <- c('bt','rr','rw') yield <- c(160,165,180,200,175,149,168,300,170) # generate fact tables customer <- cbind.data.frame(id,acres) product <- cbind.data.frame(id,type) customer_yield <- cbind.data.frame(id,yield) # generate special tech lookup table type2 <- c('g8590','g8484','p8787') tech <- cbind.data.frame(type2,trait) tech$type <- tech$type2 tech$type2 <- NULL Python: # create customer table data = {'id':[1,2,3,4,5,6,7,8,9],'acres':[1800,1970,980,960,970,1500,700,2500,2980]} customer = pd.dataframe(data,columns =['id','acres'])

10 # create product table data = {'id':[1,2,3,4,5,6,7,8,9],'type':['g8590','g8590','g8484','p8787','p8787','g8 590','G8590','P8787','G8590']} product = pd.dataframe(data,columns =['id','type']) # create tech table data = {'type':['g8590','g8484','p8787'],'trait':['bt','rr','rw']} tech = pd.dataframe(data,columns =['type','trait']) # create customer yield table data = {'id':[1,2,3,4,5,6,7,8,9],'yield':[160,165,180,200,175,149,168,300,170]} customer_yield = pd.dataframe(data,columns =['id','yield'])

Guide Users along Information Pathways and Surf through the Data

Guide Users along Information Pathways and Surf through the Data Stephen Overton, Overton Technologies, LLC, Raleigh, NC ABSTRACT Business information can be consumed many ways using the SAS Enterprise