Cleveland State University CIS 612/CIS712 Big Data & Parallel Database Processing Systems (3-0-3) Prerequisites: CIS 530. CIS 611 Preferred. Instructor: Dr. Sunnie S. Chung Office Location: FH 222 Phone: 216 687 4661 Email: s.chung@csuohio.edu sschung.cis@gmail.com Webpage: http://eecs.csuohio.edu/~sschung Office Time: Tues, Thur 2:00 4:00 PM (email me for an appointment) Catalog Description: Detailed study of modern database processing and parallel database systems for big data processing. The course first presents the concept of Transaction with ACID and concurrency control strategies in active database systems. The course continues with semi-structured/unstructured data processing strategies with Jason/Html/Xml data, XPath, and XQuery. The course advances the study with big data processing strategies on Hadoop file system with Map Reduce and focuses on the study of Massively Parallel Processing (MPP) systems for big data processing NoSQL, NewSQL systems, and Cloud Computing platforms and infrastructures for big data processing. The course covers data model, index, querying techniques, data processing methods, and ACID issues on such systems with Google s Big Table, Hive, HBase, PigLatin, Mongo DB, and VoltDB. Throughout the projects that processes real time big data stream from the popular social network sites, the students will get hands-on experiences in such big data processing systems. Finally, the course will explore the latest advances in industry research for big data processing and data analytics. Key Concepts: Transaction, Concurrency Control, Modern Database Programming, Semi-structured database processing, JASON, XML data processing, XPath, XQuery, Web Data Processing, Unstructured data processing, Massively Parallel Processing (MPP) systems, Map Reduce, Hadoop, Cloud Computing platform and infrastructures, Parallel Data Warehouse (PDW), OLAP, Big Data Processing strategies, Google s Big Table, Hive, HBase, PigLatin, MongoDB, VoltDB, ACID in NoSQL, NewSQL, and Cloud Computing. Expected Outcomes: Upon successful completion of the course, the student will be able to: Understand a well-defined Transaction concept and concurrency control strategies in database processing systems; Create modern database applications that process non-traditional data - semi-structured data such as JASON or XML data, or unstructured data such as web logging data; Understand big data processing techniques and comprehensive knowledge on Massively Parallel Processing (MPP) systems NoSQL/New SQL systems, and Cloud Computing; Obtain hands-on experiences on parallel data processing systems and tools, and cloud computing platforms and infrastructures for big data processing; Build an infrastructure for big data processing systems; Exposed to the latest advances in database industry research in big data processing; List of Required Materials: Any RDBMS: Oracle Database 11g or higher - This is available at http://www.oracle.com/us/downloads/index.html Microsoft SQL Server 2014, Microsoft Visual Studio 2014 or any higher Microsoft SQL Server Data Analytic Tool 2014 They are available at the Microsoft Academic Alliance program: http://e5.onthehub.com/webstore/productsbymajorversionlist.aspx?ws=31b9929b-c09b-e011-969d-0030487d8897 Open Source Systems: Installation/Set Up details for these system will be given in class.
Hadoop/MapReduce and VM Hive HBase PigLatin MongoDB VoltDB Text Book: 1. "Fundamentals of Database Systems". Elmasri / Navathe. 7 th Edision. Addison/Wesley Pub Co. ISBN-13: 978-0133970777 ISBN-10: 0133970779 2. will be available on the class webpage 3. List of Selected Database Research Papers on Big Data Processing Systems and Data Analytics will be given in class Supplement Materials: will be given in class Tutorials for Hadoop/Map Reduce and VM Tutorials for NoSQL Systems - Hive, HBase, PigLatin, MongoDB, Tutorials for New SQL System VoltDB Official Calendar Please consult the university web page at: http://www.csuohio.edu/enrollmentservices/registrar/calendar/index.html Final exam: Mon May 8 4:00-6:00 PM Grading: The course grade is based on a student's overall performance through the entire Semester. The final grade is distributed among the following components: 1. Exams 35% (15% Midterm, 20% Final) 2. Computer Labs 30% (about 4 Lab Assignments) 3. Project and Presentation on Big Data Processing: 2 person group project (25%) 4. Research Paper Presentation: 10% I reserve the right to change the weighting and the number of assignments. Additional Requirements for CIS712 Doctoral Students: Doctoral students who take CIS712 must select a in-depth project to work on. (Examples of the tentative topics of the projects are given in the course schedule section below) Doctoral students who take CIS712 must work on the project individually (instead of 2 person group) The list of projects and research papers for doctoral students may be given separately in class. It may vary every year. A tentative selection of projects and papers are given at the end of the course schedule here. In each exam, one additional problem might be designed to be completed by doctoral students only A 93% + A: Outstanding (student's performance is genuinely excellent) A- 92.9% - 90% B + 87% - 89.9% B 80% - 86.9% B: Student's performance is satisfied for every course requirement and acceptable but not necessarily distinguishable B- 78% - 79.9% C 70% - 77.9% C: Student's performance is not satisfied for every course requirement and is not
acceptable to pass F <70% F: Failure Examination Policy: Students are allowed to bring to the tests a summary page (standard letter size) with their own notes. During the exams: (1) the use of books, cell phones, calculators, or any electronic devices is prohibited, and (2) students must not share any materials. Make-Up Exam Policy: No makeup exams will be given unless notified and agreed to in advance. Requests will be considered only in case of exceptional demonstrated need. Homework Policy: The students are expected to attend all classes. The students are responsible for collecting the notes, handouts and any other course material distributed during the class period. All assignments must be individually and independently completed and must represent the effort of the student turning in the assignment. Should two or more students turn in substantially the same solution or output, in the judgment of the instructor, the solution will be considered group effort. All involved in group effort homework will receive a zero grade for that assignment. A student turning in a group effort assignment more than once will automatically receive an F grade for the course. Late Assignment: All lab assignments are due at the beginning of class on the date specified. Laboratory Assignments handed in after the class has begun will be accepted with a 25% grade penalty for up to a week and then not accepted at all. All laboratory assignments must be completed. Failure to do so will lower your course grade one additional letter grade. Student Conduct: Students are expected to do their own work. Academic misconduct, student misconduct, cheating and plagiarism will not be tolerated. Violations will be subject to disciplinary action as specified in the CSU Student Conduct Code. A copy can be obtained on the web page at: http://www.csuohio.edu/studentlife/studentcodeofconduct.pdf or by contacting Valerie Hinton Hannah, Judicial Affairs Officer in the Department of Student Life (MC 106 email v.hintonhannah@csuohio.edu ). For more information consult the following web page CSU Judicial Affairs available at http://www.csuohio.edu/studentlife/jaffairs/faq.html Course Schedule: The schedule of topics and their order of coverage is given below. The schedule and topics to be covered may vary depending upon the progress made. Week of Topic Reading 1, 2 DBMS Architecture, Complex Queries, Advanced Topics in Views Introduction to Big Data Transaction, ACID Concurrency Control 3-4 Modern Database Programming: Database Triggers Stored Procedure, Embedded SQL, Dynamic SQL, JDBC/ODBC, PHP User Defined Function (UDF), User Defined Type (UDT), User Defined Aggregate (UDA), Table Function, CLR, LINQ,.NET Database Programmability and Extensibility in Microsoft SQL Server by José A. Blakeley, et al (Microsoft Corporation) in the proceedings of SIGMOD 2008 Elmasri, Chap. 21, 22,
5-6 Modern Databases: Enhanced Data Models for Advanced Applications Semi Structured and Unstructured Databases: XML Data Processing: - XML Schema, Syntax/Semantics, Protocol - XPath - XQuery Data Transformation from Semi Structure to Relation JSON Data Processing, Elmasri Chap. 12, 19, 20 Listed papers. 7-8 Introduction to Information Retrieval and Web Data Processing Data Models for Unstructured Data Processing. Selected Papers Bigtable: A Distributed Storage System for Structured Data, by Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Google, Inc. in the Proceedings of OSDI 2006 9-10 Big Data Processing and Parallel Computing: Introduction of Big Data Google s Map Reduce Paradigm Apache Hadoop File System for Parallel Processing MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean (Google) and Sanjay Ghemawat (Google) in the proceedings of OSDI 2004 Apathy Hadoop in White Papers by Apache, Yahoo 11-14 Big Data Processing and Massively Parallel Processing Systems NoSQL/NewSQL Systems Pig Latin on Apache Hadoop by Yahoo and Apache Data Warehouse HIVE with Hadoop by Facebook HBase Key Value Stores Map Reduce Join Algorithms Parallel Data Warehouse with OLAP Query Processing: Microsoft Extended PDW with Map Reduce and Hadoop : Oracle, Teradata MongoDB VoltDB NoSQL vs NewSQL ACID Tutorials Pig Latin: A Not-So-Foreign Language for Data Processing, Christopher Olston, et al. (Yahoo! Research) in the proceedings of SIGMOD 2008 Data Warehousing and Analytics Infrastructure at Facebook. by Ashish Thusoo, et al. (Facebook) in the proceedings of SIGMOD 2010 Petabyte Scale Databases and Storage Systems Deployed at Facebook, Dhruba Borthakur, et al. in the proceedings of SIGMOD 2014 Fast Data in the Era of Big Data: Twitter s Real-Time Related Query Suggestion Architecture, Gilad Mishne, Jeff Dalton, Zhenghua Li, Aneesh Sharma, Jimmy Lin (Twitter, Inc), SIGMOD 2014. The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah (LinkedIn), SIGMOD 2015 Avatara: OLAP for Webscale Analytics Products Lili Wu Roshan Sumbaly Chris Riccomini Gordon Koo Hyung Jin
Kim Jay, Kreps Sam Shah (LinkedIn), SIGMOD 2014 More papers will be given in class Cloud Computing Microsoft Azure as a Self-Managing Database Service: Lessons Learned and Challenges Ahead by Kunal Mukerjee, et al (Microsoft) in the proceedings of IEEE Computer Society Technical Committee on Data Engineering 2014 15, 16 Presentation on Significant Database Research in Big Data Processing, Cloud Computing, and more: List of Selected Papers will be given in class. Selected Papers Tentative Technical Presentation Topics: (It may vary every year) 1. Semistructured/Unstructured Data Processing 2. Hadoop based Data Warehousing and Analytics Infrastructure at Facebook 3. Parallel Computing for Big Data Processing: Google Cloud, Amazon Cloud Hadoop Based NoSQL Systems NewSQL Systems 4. MapReduce: Simplified Data Processing on Large Clusters by Google 5. Lammal, Ralf. Google's MapReduce Programming Model Revisited. 6. Stream Processing Sparks 7. NoSQL Systems: Pig Latin, HBase, Hive, Mongo DB, 8. Map Reduce Join Algorithmes, 9. Data Partition Techniques 10. Performance Survey : SQL vs NoSQL 11. Processing MR/Hadoop with PDW : Oracle, Teradata 12. Information Retrieval: Google Search Engine 13. Big Data Integration Systems 14. Cloud Computing : Microsoft AZURE, Amazon Cloud, Google Cloud 15. More on these Tentative List of Research Papers and Projects for CIS 712 Doctoral Students: CIS 712 Doctoral Students should choose one of the research topics and give a 30 min presentation on the papers in the topic and complete a project related to the subjects. The Paper List and Project Specification on each research topic will be given in class. ADA Adherence: If you need course adaptations or accommodations because of a disability, if you have emergency medical information to share with me, or if you need special arrangements in case the building must be evacuated, please make an appointment with me as soon as possible. My office location and hours are listed on top of this syllabus. If you need further information, please contact the Office of Disability Services (Main Classroom 147), phone number 216.687.2015, on the web at http://www.csuohio.edu/offices/disability/.