Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Size: px

Start display at page:

Download "Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect"

Allan Stewart Cunningham
5 years ago
Views:

(DATA STORE ON CLOUD, HADOOP,HDFS,HBASE)

1 Big Data Big Data Analyst INTRODUCTION TO BIG DATA ANALYTICS ANALYTICS PROCESSING TECHNIQUES DATA TRANSFORMATION & BATCH PROCESSING REAL TIME (STREAM) DATA PROCESSING Big Data Engineer BIG DATA FOUNDATION CONCEPTS (USE CASE,DATA STRUCTURE DESIGN & DISTRIBUTED ALGORITHMS) BIG DATA STORAGE (DATA STORE ON CLOUD, HADOOP,HDFS,HBASE) BIG DATA PROCESSING (VIRTUAL MACHINES, MAP REDUCE,APACHE SPARK WITH SCALA) Big Data Architect INTRODUCTION TO HADOOP ECOSYSTEM BIG DATA ETL (DATA AREHOUSING,ETL, APACHE SQOOP,FLUME,HIVE) STREAMING BIG DATA (APACHE STORM,SPARK,KAFKA,CASSANDRA)

2 This course is suitable for Software Professionals TRAINING PROGRAM FOR BIG DATA ANALYST Prerequisites Java Programming, SQL [RDBMS], Basic Statistics Total Program Duration Hours per week Total Hours Video Hours Exercise Hours 75 days 6 63 hours [Week 1-6] [33 hours] [Week 7-8] [13 hours] [Week 9-11] [ 17 hours] Introduction to Big Data Analytics [8 hours] Analytics problems & Applications Regression Classification Clustering Overview of MLLib Overview of QlikView Analytics Tasks & (Processing Techniques) [21 hours] Defining Regression Learning to use MLLib for Linear Regression QlikView for visualization of Result Defining Classification Forms of Classifier Models Learning to use MLLib for Classification Defining Clustering MLLib for Clustering Case Study & Assignment [4 hours] a. Credit Card Fraud detection b. Customer segmentation for targeted marketing Data Transformation & Batch Processing [7 hours] Introduction to Hive - Interfaces, MetaStore. Hive vs. Relational Database Systems Hive: File Formats Querying in Hive (HiveQL) Complex Queries Hive v/s Hbase [1.5 hours] Comparing Hive and HBase - Use Cases - Hive/HBase. When to/not-to use Hive / Hbase Query Optimization [1 hour] Need for query optimization; Some optimization techniques (2 or 3 - ORC file, CBO, Vectorization or Bucketing) Batch processing with Hive [3.5 hours] Introduction to Streaming Data [4 hours] Characteristics of streaming data Components of a real time stream processing system Features of a real time stream processing architecture Twitter Sentiment Analysis Apache Storm [4 hours] Trident - Data Streaming Library Overview Case Study: Real Time Processing of ecommerce data Streaming on Spark [ 9 hours] Understanding Spark Streaming API Understanding DStream Abstraction Processing a Data Stream Case Study - Real Time Processing of ecommerce data (Extending prev. case study)

3 This course is suitable for Software Professionals Prerequisites TRAINING PROGRAM FOR BIG DATA ENGINEER Java Programming, Graduation Level Data Structures & Algorithms, Object Oriented Programming Basics of SQL [RDBMS] Total Program Duration Hours per week Total Hours Video Hours Exercise Hours 140 days [Week 1-6] [35 hours] [Week 7-9] [16 hours] [Week 10-20] [63 hours] BIG DATA FOUNDATION CONCEPTS Big Data & Real-Life Applications [1 hour] Introduction - Big Data & Its Various Aspects Major Sources of Big Data Types of Data 4 V's of Data Data Models - Structured, Semi-Structured and Unstructured data Big Data Industry Use Cases [1 hour] a. Case Study 1: Churn Prediction (Mobile companies or Credit Card companies like AmEx etc.) b. Case Study 2: Product Recommendations on a Retail Website (E-Bay/Snapdeal etc.) c. Case Study 3: Getting relevant Search results using Google search Conventional Data Processing System & Big Data BIG DATA STORAGE Data Store on the Cloud [6 hours] SQL & NO SQL Databases on Cloud Amazon S3 DynamoDB Setup & Operations Introduction to HADOOP & HDFS [3 hours] What is Hadoop? Hadoop Clusters & Features Nature of Data a Structured / Unstructured Data b Examples, Use Cases Data Stores and Processing BIG DATA PROCESSING VIRTUALIZATION TECHNOLOGY & INFRASTRUCTURE [ 6 hours] Virtual Machines What is Virtualization Setting up Virtual machine using Cloudera/Hortonworks Amazon EC2 - Amazon Web Services (AWS) and AWS EC2 Operations on Amazon EC2 virtual machine Algorithm Design using Map-Reduce [12 hours] SPMD Programming - Map: Use Case and Examples Performance Analysis and Issues in using Map SPMD Programming - Tree Parallelism - Reduce: Use case and Examples Performance Analysis and Issues in using Reduce

4 [Week 1-6] [Week 7-9] [Week 10-20] Data Abstraction [7 hours] What is Abstraction? Data Abstraction - User vs. Provider Perspectives, Data Types and Abstract Data Types ADT) Interface vs. Internals Realizing Abstraction in Procedural Languages Data Structures (Linear) [12 hours] Dictionary Data Structure: Lookup Time- Array, Sorted Array; Pre-processing Time & Amortized Cost Pre-processing Time - Sorting Time, Sorting with a Known Range; Hashtable - Load Factor, Sizing, and Rehashing Polymorphism Key-value pairs and Hashmaps Bloom Filters - Use cases, Design, Implementation Hadoop Distributed File System (HDFS) [4 hours] a. HDFS Components and Architecture - Blocks and Nodes b. HDFS Commands and Command Line Interface c. HDFS Java API & Usage d. HDFS - Basic File Input/output e. Storage & Load Balancing HBASE [3 hours] a. Need and Use of HBASE - Schemas and Queries b. HBASE - Java API c. Comparison with traditional Relational Database Systems Map-Reduce Programming: Composing Map and Reduce Examples Map-Reduce Programming: Iterative Map- Reduce Programming using MR on Hadoop [6 hours] a. Setting up key-value pairs, identifying map tasks and reduce tasks, connecting map tasks to reduce tasks b. Writing a program using multiple mappers and reducers - Examples c. Performance and scalability; deciding number of mappers and reducers- Scheduling and Tuning d. Sorting and Joins in the Map-Reduce Model Apache Spark with Java [14 hours] Architecture & Programming In-memory Processing; Java programming on spark Programming on Spark: RDDs Programming on Spark: DataFrames & DataSets

5 [Week 1-6] [Week 7-9] [Week 10-20] Data Structures (Non-Linear) [7.5 hours] Tree - Structure and Definitions Search Tree - Motivation; Design of Binary Search Trees; Time Taken for Queries Range Queries and Efficiency Issues kd Trees -Use Cases, Design, Implementation Distributed Algorithms [1.5 hours] Algorithm Design - Review of Basics Top-Down Design - Review: Characteristics & Pragmatics Divide-and-Conquer Design - Review: Characteristics and Pragmatics Distributed Algorithms - Design & Performance [ 4 hours] Abstract Machine Model and Design Approach Divide-and-Conquer Design for Distributed Execution Example Performance Model for Distributed Algorithms - Speedup; Communication Cost Apache Spark with Scala [25 hours] Introduction & Create a Histogram of Real Movie Ratings with Spark. What is Scala Flow Control in Scala Functions & Data Structures in Scala A Spark Basics & Primitive Examples Introduction to Spark The Resilient Distributed Dataset Ratings Histogram Walkthrough Spark Internals Key/Value RDDs and the Average Friends by Age example Filtering RDDs and the Minimum Temperature by Location Example Using flatmap with wordcount example B Advanced Examples of Spark Programs Superhero Degrees of Separation Introducing Breadth-First Search Superhero Degrees of Separation Accumulators and Implementing BFS in Spark

6 [Week 1-6] [Week 7-9] [Week 10-20] Superhero Degrees of Separation Review the Code and Run It! Item-Based Collaborative Filtering in Spark, cache (), and persist () C Running Spark on a Cluster Introducing Amazon Elastic MapReduce Creating Similar Movies from One Million Ratings on EMR Partitioning Best Practices for Running on a Cluster Troubleshooting and Managing Dependencies D SparkSQL, Dataframes & Datasets

7 This course is suitable for Software Professionals Prerequisites TRAINING PROGRAM FOR BIG DATA ARCHITECT Primary Overview of Hadoop Ecosystem, Object Oriented Programming Fundamentals, Knowledge of Scala & Apache Spark with Scala Total Program Duration Hours per week Total Hours Video Hours Exercise Hours 120 days [Week 1-5] [28 hours] [Week 6-11] [33 hours] [Week 12-17] [34 hours] Introduction to HADOOP & HDFS [3 hours] What is Hadoop? Hadoop Clusters & Features Nature of Data a Structured / Unstructured Data b Examples, Use Cases Data Stores and Processing Hadoop Distributed File System (HDFS) [4 hours] a. HDFS Components and Architecture - Blocks and Nodes b. HDFS Commands and Command Line Interface c. HDFS Java API & Usage d. HDFS - Basic File Input/output e. Storage & Load Balancing Data Warehousing & ETL [5.25 hours] Extraction, Transformation & Load ETL vs ELT Data Warehousing Fundamentals Facts and Dimension tables Relational vs. Multi-Dimensional Data Representation Reports, Dashboard and Score Card ETL - Relevance in the Big Data scenario - Data Lakes Sqoop in Hadoop [6.80 hours] What is data ingestion Sources of structured/unstructured data/real time/streaming data Introduction to Streaming Data [4 hours] Streaming Data Characteristics of streaming data Components of a real time stream processing system Features of a real time stream processing architecture Social Media Data Apache Storm [9 hours] Elements of a stream processing system Components of a Storm Cluster Configuration of Storm Cluster Trident - Data Streaming Library

8 [Week 1-5] [Week 6-11] [Week 12-17] HBASE [3 hours] a. Need and Use of HBASE - Schemas and Queries b. HBASE - Java API c. Comparison with traditional Relational Database Systems Algorithm Design using Map-Reduce [12 hours] SPMD Programming - Map: Use Case and Examples Performance Analysis and Issues in using Map SPMD Programming - Tree Parallelism - Reduce: Use case and Examples Performance Analysis and Issues in using Reduce Map-Reduce Programming: Composing Map and Reduce Examples Map-Reduce Programming: Iterative Map- Reduce Motivation and Usage of Sqoop - Import data to Hadoop; Different data / file formats Sqoop and MapReduce - The Import Process Sqoop - Performance: Importing Large Objects Data export operation using Sqoop Apache Flume [8 hours] Events and Flows: Motivation for Flume Using Flume - Ingestion of events / log data; Ingestion of streaming data Flume - Flows (Multi-hop, Consolidation, Replication, Multiplexing) and Configuration (multi-agent, fan-out) Flume - (select) Sources and Configuration Log Processing using Flume Apache Spark [9 hours] Introduction Understanding Spark Streaming API Understanding DStream Processing a Data Stream Case Study - Real Time Processing of ecommerce data Apache Kafka [ 3 hours] Introduction & Architecture A Overview Building Applications in the Publish- Subscribe Architecture Topics and Partitions How Kafka Cluster is Built? Brokers Setup of Kafka - Using Docker B Producers & Consumers Sending Events to Kafka - Producers API Asynchronous Send Partitioning of Topics - Implementing Custom Partitioner Reading Events from Kafka - Consumer API Consumer Pool Loop - Offset Management Rebalancing of Consumers

9 [Week 1-5] [Week 6-11] [Week 12-17] Programming using MR on Hadoop [6 hours] a. Setting up key-value pairs, identifying map tasks and reduce tasks, connecting map tasks to reduce tasks b. Writing a program using multiple mappers and reducers - Examples c. Performance and scalability; deciding number of mappers and reducers- Scheduling and Tuning d. Sorting and Joins in the Map-Reduce Model Apache Hive [13.25 hours] -Hive - Interfaces, MetaStore; -Hive vs. Relational Database Systems -Hive: File Formats -Querying in Hive (HiveQL) -Complex Queries A Hive v/s Hbase -Comparing Hive and HBase -Use Cases for Hive/HBase B Query Optimization -Optimization Techniques (ORC file, CBO, Vectorization or Bucketing) C Batch processing with Hive C Understanding Internals Electing Partition Leaders - Kafka Controller Component Data Replication in Kafka Append-Only Distributed Log - Storing Events in Kafka Compaction Process Apache Cassandra [9 hours] Introduction to Cassandra Cassandra Distributed Architecture Diagnostics Data Modelling Principles Data Modelling in Cassandra Optimization of Data Connecting Spark with Cassandra Integrating Cassandra with Spark Streaming

Big Data Architect.

Big Data Architect. Big Data Architect www.austech.edu.au WHAT IS BIG DATA ARCHITECT? A big data architecture is designed to handle the ingestion, processing, and analysis of data that is too large or complex for traditional