MEIC 2015/2016 Data Analysis and Integration Lab 5: Working with databases 1 st semester Installing MySQL 1. Download MySQL Community Server for your operating system. For Windows, use one of the following links: Full installer (370.5 MB) http://dev.mysql.com/get/downloads/mysqlinstaller/mysql-installer-community-5.7.9.0.msi Web installer (1.6 MB) http://dev.mysql.com/get/downloads/mysqlinstaller/mysql-installer-web-community-5.7.9.0.msi For Mac OS X, use one of the following links: Mac OS X 10.10 (Yosemite) http://dev.mysql.com/get/downloads/mysql-5.7/mysql-5.7.9-osx10.10-x86_64.dmg Mac OS X 10.9 (Mavericks) http://dev.mysql.com/get/downloads/mysql-5.7/mysql-5.7.9-osx10.9-x86_64.dmg For Linux, use one of the following: Ubuntu sudo apt-get install mysql-server mysql-client mysql-workbench Other distros http://dev.mysql.com/downloads/mysql/ (If you are asked to login, click "No thanks, just start my download.") 2. Install MySQL according to the specific instructions for your operating system. On Windows, the only thing that you need to install is MySQL Server. Optionally, you might want to install MySQL Workbench and MySQL Notifier as well. During the installation process, set the MySQL Root Password. For the moment, you do not need to add any other MySQL User Accounts. IST/DEI Page 1 of 11
For convenience, add the MySQL executables to the system PATH. On Windows, the path should be C:\Program Files\MySQL\MySQL Server 5.7\bin. Do not replace the entire PATH variable, just add another path to the variable. Use semi-colon (;). On Mac OS X, to add the MySQL executables to your PATH you might need to run the following commands on a Terminal: echo 'export PATH="/usr/local/mysql/bin:$PATH"' >> ~/.bash_profile source ~/.bash_profile Furthermore, to set the MySQL Root Password, you might need to run: mysqladmin -u root password yourpasswordhere IST/DEI Page 2 of 11
Creating the SteelWheels database 3. Download the SteelWheels.sql file that has been published together with this lab. Save the file to some folder (e.g. Desktop). 4. Open a Command Prompt (or Terminal) and navigate to the folder where you have saved the SteelWheels.sql file. 5. Execute the following command: mysql -u root -p When prompted, give your MySQL Root Password. 6. You are now logged into MySQL. Execute the following commands to create the database and its user: CREATE DATABASE steelwheels; CREATE USER 'steelwheels'@'localhost' IDENTIFIED BY 'steelwheels'; GRANT ALL PRIVILEGES ON steelwheels.* TO 'steelwheels'@'localhost'; 7. Execute the following command to create the tables and load the data: SOURCE SteelWheels.sql 8. You have now created the SteelWheels database. Write quit to quit the MySQL client. IST/DEI Page 3 of 11
Add the MySQL database driver to PDI 9. Get the MySQL database driver from here: http://dev.mysql.com/get/downloads/connector-j/mysql-connector-java-5.1.37.zip 10. Open the ZIP file and extract the JAR file (mysql-connector-java-5.1.37-bin.jar) to some folder. 11. Copy/move the JAR file to the lib folder of your PDI installation (\data-integration\lib). You will find other JARs in that folder. Creating a database connection in PDI 12. Open PDI (Spoon) and create a new transformation. 13. Change to the View tab, and expand Transformations > Transformation 1 > Database connections. 14. Right-click Database connections and select New. 15. In the Database Connection dialog: In Connection Name write steelwheels In Connection Type select MySQL In Access select Native (JDBC) In Host Name write localhost In Database Name write steelwheels In User Name write steelwheels In Password write steelwheels Click Test to test the connection. 16. Close the Database Connection Test dialog and close the Database Connection dialog with OK. IST/DEI Page 4 of 11
17. In the View tab, right-click the steelwheels database connection and select Share. This will make the database connection available to every transformation. Exploring the database 18. Right-click the steelwheels database connection and select Explore. 19. In the Database Explorer dialog, expand steelwheels and Tables. 20. Right-click the customers table and select View SQL. 21. In the Simple SQL editor dialog, change the query to: SELECT CUSTOMERNUMBER, CUSTOMERNAME, CITY, COUNTRY FROM CUSTOMERS 22. Click Execute to preview the data. 23. Close the Examine preview data dialog, the Results of the SQL statements dialog, the Simple SQL editor dialog, and the Database Explorer dialog. Querying the database in a transformation 24. Switch to the Design tab, and drag a Table input step to the transformation. 25. Configure the Table input step: In Connection select steelwheels Click on Get SQL select statement Expand steelwheels and Tables Select orders and click OK When asked if you want to include the field names in the SQL, click Yes At the end of the SQL statement, add: WHERE STATUS = 'Shipped' Click on Preview and then OK The following window appears: IST/DEI Page 5 of 11
26. Close the window and press OK to close the step configuration. 27. Add the following sequence of steps to the transformation: 28. Configure the Calculator step: Add a new field called diff_days to calculate Date A - Date B (in days) where A is SHIPPEDDATE and B is REQUIREDDATE. 29. Configure the Number range step: In Input field select diff_days In Output field write delivery Configure the Ranges as shown here: 30. Configure the Sort rows step to sort by diff_days. IST/DEI Page 6 of 11
31. Configure the Select values step to select the fields delivery, ORDERNUMBER, REQUIREDDATE, and SHIPPEDDATE. 32. Click on the Select values step, and do a Preview. 33. How many orders were shipped earlier than the required date? How many orders were shipped on time? How many orders were shipped late? 34. Save your transformation. Using query parameters 35. Add a Data grid step to the transformation and connect it to the Table input step. 36. Configure the Data Grid step: Change the Step name to Query params IST/DEI Page 7 of 11
In the Meta tab, create two new fields called date_from and date_to, both of type String. Switch to the Data tab and, in the first row, write 2004-12-01 in the first column, and 2004-12-10 in the second column. Make sure that you have only one row. Delete any additional (empty) rows that are not being used. 37. Double-click the Table input step to change its configuration: At the end of the SQL statement, add the following line: AND ORDERDATE BETWEEN? AND? Check the option Replace variables in script In Insert data from step select Query params 38. Click on the Select values step, and do a Preview. You should now have only 10 rows. Check that the dates are within the date_from and date_to values. IST/DEI Page 8 of 11
Storing the results in the database 39. Add a Table output step to the transformation and connect Select values to the Table output. When connecting the two steps, select Main output of step. 40. Configure the Table output step: In Connection select steelwheels In Target table write results Check the option Truncate table Press the SQL button. The Simple SQL editor will appear with a CREATE statement. o Replace REQUIREDDATE UNKNOWN with REQUIREDDATE TIMESTAMP o Replace SHIPPEDDATE UNKNOWN with SHIPPEDDATE TIMESTAMP o Add a line with the primary key: PRIMARY KEY(ORDERNUMBER) o Press the Execute button to create the table. 41. Once the results table has been successfully created, you can close the Table output configuration with OK. IST/DEI Page 9 of 11
42. Click on the Table output step, and do a Preview. You should have the same results as before. The difference is that these results have now been written to the database. 43. Save your transformation. 44. Open a Command Prompt (or Terminal) and execute the following command: mysql -u steelwheels -p When prompted, give the password: steelwheels 45. You are now logged into MySQL. Execute the following command to select the steelwheels database: USE steelwheels; 46. Execute the following command to show the database tables: SHOW TABLES; Check that there is a results table. 47. Query the results table with: SELECT * FROM results; 48. Write quit to quit the MySQL client. IST/DEI Page 10 of 11
Pencil & paper exercise Consider a data migration system where the goal is to populate the following table: Person(firstname, lastname) The input data comes from two different sources: The first data source is the table Customer(name, surname): +-----------+-----------+ name surname +-----------+-----------+ Alexander Scott Paul Wilson Scott Craig Simon Alexander Craig Wilson +-----------+-----------+ 5 rows in set The schema matches between the Customer table and the Person table are: Customer.name Person.firstname Customer.surname Person.lastname The second data source is a CSV file with the following contents: The schema matches between the CSV file and the Person table are unknown. Find these schema matches with a Naive Bayes learner, using the examples in table Customer as training data and the examples in the CSV file as test data. IST/DEI Page 11 of 11