Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

Size: px

Start display at page:

Download "Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018"

Theodora Arnold
5 years ago
Views:

1 Big Data com Hadoop Impala, Hive e Spark VIII Sessão - SQL Bahia 03/03/2018 Diógenes Pires

2 Connect with PASS Sign up for a free membership today at: pass.org #sqlpass

5 Internet Live

6 Introduction

7 Big Data is not Bitcoin

8 Sources for Big Data Data Warehouse RDBMS Web server log files; Social Media Contents; Business Reports; Texts of consumer s to the company; Macroeconomic indicators; Satisfaction surveys; IoT CRM

Definitions Business intelligence (BI) is an umbrella term that includes the applications, infrastructure and tools, and best practices that enable access to and analysis of information to improve

9 Definitions Business intelligence (BI) is an umbrella term that includes the applications, infrastructure and tools, and best practices that enable access to and analysis of information to improve and optimize decisions and performance. Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation. Business analytics is comprised of solutions used to build analysis models and simulations to create scenarios, understand realities and predict future states. Business analytics includes data mining, predictive analytics, applied analytics and statistics, and is delivered as an application suitable for a business user. Gartner

10 Other Concepts Cognitive Computing Data Discovery Data Lake Data Science Machine Learning Self BI Fast Data

11 Landscape

Google File System (GFS or GoogleFS) Google File System (GFS or GoogleFS) is a proprietary distributed file system developed by Google to provide efficient, reliable access to

12 Google File System (GFS or GoogleFS) Google File System (GFS or GoogleFS) is a proprietary distributed file system developed by Google to provide efficient, reliable access to data using large clusters of commodity hardware. A new version of Google File System code named Colossus was released in Wikipedia 2003 GFS 2004 MapReduce 2006 Big Table

processing of large data sets across clusters of

13 Apache Hadoop The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models Apache Hadoop.

14 Apache Hadoop The project includes these modules: Hadoop Common: The common utilities that support the other Hadoop modules. Hadoop Distributed File System (HDFS ): A distributed file system that provides high-throughput access to application data. Hadoop YARN: A framework for job scheduling and cluster resource management. Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

15 Others Hadoop Projects

16 Hadoop Architecture

17 Processing Types of Processing Batch Processing: This is batch processing, information is collected or received, stored and processed. Online Processing: It is the updated processing, the information is processed at the same time as it is registered. Real Time Processing: It is the immediate processing, the information is processed the moment it is registered, generating a new processing sub sequent. Ex.: Autopilot, GPS.

18 Batch Processing

19 Example

20 MapReduce A programming paradigm that allows for massive scalability across hundreds or thousands of servers in a Hadoop cluster. IBM.

21 MapReduce

using SQL. Structure can be projected onto data already in storage.

22 Apache Hive The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Structure can be projected onto data already in storage. A command line tool and JDBC driver are provided to connect users to Hive. Hive.org.

23 Apache Architecture

24 Online Processing

25 Example

26 Cloudera Impala Cloudera Impala provides fast, interactive SQL queries directly on your Apache Hadoop data stored in HDFS or HBase. In addition to using the same unified storage platform, Impala also uses the same metadata, SQL syntax (Hive SQL), ODBC driver, and user interface (Cloudera Impala query UI in Hue) as Apache Hive. Cloudera.

27 Impala Architecture

28 Real Time Processing

29 Example

30 Apache Spark Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning. spark.apache.org

31 Apache Spark

Big Data with Hadoop Ecosystem

Diógenes Pires Big Data with Hadoop Ecosystem Hands-on (HBase, MySql and Hive + Power BI) Internet Live http://www.internetlivestats.com/ Introduction Business Intelligence Business Intelligence Process