Big Data with Hadoop Ecosystem

Diógenes Pires Big Data with Hadoop Ecosystem Hands-on (HBase, MySql and Hive + Power BI)

Internet Live http://www.internetlivestats.com/

Introduction

Business Intelligence

Business Intelligence Process

Some tools

Sources for Big Data Data Warehouse RDBMS Web server log files; Social Media Contents; Business Reports; Texts of consumer emails to the company; Macroeconomic indicators; Satisfaction surveys; IoT CRM

Examples

Example

Main Concepts

Definitions Business intelligence (BI) is an umbrella term that includes the applications, infrastructure and tools, and best practices that enable access to and analysis of information to improve and optimize decisions and performance. Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation. Business analytics is comprised of solutions used to build analysis models and simulations to create scenarios, understand realities and predict future states. Business analytics includes data mining, predictive analytics, applied analytics and statistics, and is delivered as an application suitable for a business user. Gartner

Other Concepts Cognitive Computing Data Discovery Data Lake Data Science Machine Learning Self BI Fast Data

The competitive advantages Identification of patterns Competitor analysis Product Development Data driven marketing Measure customer dissatisfaction

Big Data

Landscape

Big Data is not Bitcoin

Google File System (GFS or GoogleFS) Google File System (GFS or GoogleFS) is a proprietary distributed file system developed by Google to provide efficient, reliable access to data using large clusters of commodity hardware. A new version of Google File System code named Colossus was released in 2010. Wikipedia 2003 GFS 2004 MapReduce 2006 Big Table

Apache Hadoop The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models Apache Hadoop.

Apache Hadoop The project includes these modules: Hadoop Common: The common utilities that support the other Hadoop modules. Hadoop Distributed File System (HDFS ): A distributed file system that provides high-throughput access to application data. Hadoop YARN: A framework for job scheduling and cluster resource management. Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

Others Hadoop Projects

Distributions

Architecture

Hadoop Architecture

MapReduce A programming paradigm that allows for massive scalability across hundreds or thousands of servers in a Hadoop cluster. IBM.

MapReduce

Frameworks

Apache Hive The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Structure can be projected onto data already in storage. A command line tool and JDBC driver are provided to connect users to Hive. Hive.org.

Apache Architecture

Cloudera Impala Cloudera Impala provides fast, interactive SQL queries directly on your Apache Hadoop data stored in HDFS or HBase. In addition to using the same unified storage platform, Impala also uses the same metadata, SQL syntax (Hive SQL), ODBC driver, and user interface (Cloudera Impala query UI in Hue) as Apache Hive. Cloudera.

Impala Architecture

NoSQL NoSQL is a term used to describe high-performance non-relational databases. NoSQL databases use a variety of data models, including documents, graphs, key-values, and columnar data. Amazon.

NoSQL No PAIN No GAIN NoSQL no join.

HBASE HBase is an open-source, non-relational, distributed database modeled after Google's Bigtable and is written in Java. Hive.apache.org.

HBASE Example Relational view Column family view

Hands-on

Communication channels Hands On in, www.bilivre.com.br facebook.com/bilivre BI Livre