Introduction into Big Data analytics Lecture 2 Big data platforms. Janusz Szwabiński

Size: px

Start display at page:

Download "Introduction into Big Data analytics Lecture 2 Big data platforms. Janusz Szwabiński"

Luke Melton
6 years ago
Views:

1 Introduction into Big Data analytics Lecture 2 Big data platforms Janusz Szwabiński

2 Outlook of today s talk Available Big Data Sets Project suggestions Big data platforms

3 Available Big Data Sets Pointers to data sets How To Get Experience Working With Large Datasets Quora KDNuggets Datasets for Data Mining and Data Science Research Pipeline Google Public Data Directory

4 Available Big Data Sets Generic repositories AWS Public Datasets Comprehensive Knowledge Archive Network Stanford Large Network Dataset Collection Open Flights ASA Flight data Wikipedia

5 Available Big Data Sets Geo data OpenStreetMap Natural Earth Data GeoNames Libre Map Project

6 Available Big Data Sets Web data Google n-gram Public Terabyte Dataset Project Freebase Data StackOverflow UCI KDD Data

7 Available Big Data Sets Government data European Parliament proceedings US government data UK government data US Patent and Trademark Office World Bank data Public health data sets Aid information UN data Polish Statistical Office

8 Suggestions for projects Large-scale data pollution with Apache Spark a method to pollute a clean, homogeneous and large data set from an arbitrary domain with duplicates, errors and inhomogeneities Trend prediction in fashion Quote search engine Real-time analysis of Twitter s public stream with Storm Correlating price/volume of low volume stocks with social media search information related to future price and volume movements find indicators to predict abnormal price or volume changes

9 Suggestions for projects Analysis on the cancer genome (big) data TCGA data set goal: patient-based treatment recommendation Stock signal generation using real time Twitter analysis develop a scoring mechanism that summarizes Twitter news generate a real-time signal that could be used to make trading decisions Music recommendation system with geospatial information MMTD - Million Musical Tweets Dataset

10 Suggestions for projects How to name your new-born baby? prediction of trends in baby names around the world Error correction in OCR datasets Movie exploration/recommendation system Best transport choice Fake reviews detection Food identification in photos see e.g. Oscar/Golden Globe award analysis

11 Suggestions for projects Interesting ideas for trendy writers Image-based geolocalization Animal identification in photos Plant identification in photos Currency trend analyzer data source: Exploring the meetup.com social world

12 Big data platforms one stop solution for Big Data needs integrated IT solution for developing, deploying and managing Big Data combines several software systems, tools and hardware to provide easy to use system to enterprises important features: able to accommodate new tools based on the business requirement supports linear scale-out has capability for rapid deployment supports variety of data formats provides data analysis and reporting tools provides real-time data analysis software has tools for searching the data through large data sets

13 Hadoop an open-source software framework for storing data and running applications on clusters of commodity hardware why it is so important? ability to store and process huge amounts of any kind of data, quickly computing power - Hadoop's distributed computing model processes big data fast the more computing nodes you use, the more processing power you have fault tolerance - data and application processing are protected against hardware failure if a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail multiple copies of all data are stored automatically. flexibility - unlike traditional relational databases, you don t have to preprocess data before storing it low cost - the open-source framework is free and uses commodity hardware to store large quantities of data scalability - you can easily grow your system to handle more data simply by adding nodes with little administration effort

14 Hadoop Source:

15 Hadoop challenges: MapReduce programming is not a good match for all problems distribution providers are racing to put relational (SQL) technology on top of Hadoop Hadoop administration seems part art and part science, requiring low-level knowledge of operating systems, hardware and Hadoop kernel settings data security issues not efficient for iterative and interactive analytic tasks a widely acknowledged talent gap - it can be difficult to find entry-level programmers who have sufficient Java skills to be productive with MapReduce good for simple information requests and problems that can be divided into independent units Kerberos authentication protocol is a great step toward making Hadoop environments secure lacking tools for data quality and standardization

16 Hadoop important application domains: Digital Marketing Optimization Data exploration and discovery (Product and sales data for online shopping portal and stores) Fraud detection and prevention Social network and relationship in the network Fraud detection in banking Fraud detection for telecom industry Data retention (for retaining the long term data and for archiving purposes) Insurance Healthcare

17 Hadoop based commercial platforms Cloudera Amazon EMR Hortonworks MapR IBM Open Platform Microsoft HDInsight Intel Distribution for Apache Hadoop Datastax Enterprise Analytics Teradata s Hadoop for Enterprise Pivotal HD

18 Cloudera one of the first commercial Hadoop based platforms interesting (and free) download: QuickStarts for CDH virtualized clusters for easy installation on your desktop single-node cluster that make it easy to quickly get handson with CDH for testing, demo, and self-learning purposes includes Cloudera Manager for managing the cluster tutorial, sample data, and scripts for getting started included deployed via Docker containers or VMs

19 What is Docker? a computer program that performs operating-system-level virtualization (aka containerization) developed by Docker, Inc. designed to make it easier to create, deploy, and run applications by using containers containers allow a developer to package up an application with all of the parts it needs, such as libraries and other dependencies, and ship it all out as one package thanks to the container, the developer can rest assured that the application will run on any other Linux machine regardless of any customized settings that machine might have that could differ from the machine used for writing and testing the code

20 What is Docker? primarily developed for Linux it uses the resource isolation features of the Linux kernel and a union-capable file system (e.g. OverlayFS) independent "containers" run within a single Linux instance a Docker is a bit like a virtual machine but unlike a virtual machine, rather than creating a whole virtual operating system, Docker allows applications to use the same Linux kernel as the system that they're running on and only requires applications be shipped with things not already running on the host computer no overhead of starting and maintaining virtual machines (VMs) a significant performance boost and reduces the size of the application a Windows version of Docker is also available

21 Amazon EMR a managed Hadoop framework that makes it easy, fast, and cost-effective to process vast amounts of data across dynamically scalable Amazon EC2 instances other popular distributed frameworks such as Apache Spark, HBase, Presto, and Flink possible interaction with data in other AWS data stores such as Amazon S3 and Amazon DynamoDB secure and reliable handling of a broad set of big data use cases, including log analysis, web indexing, data transformations (ETL), machine learning, financial analysis, scientific simulation, and bioinformatics simple and predictable pricing: per-second rate for every second used, with a one-minute minimum charge a 10-node Hadoop cluster: $0.15 per hour

22 Amazon EMR good to know: AWS Free Tier (12 Month Introductory Period)

23 Hortonworks a leading innovator in the data industry, creating, distributing and supporting enterprise-ready open data platforms and modern data applications 100% open-source software without any propriety software Hortonworks Hadoop distribution is enterprise ready with following features: centralized management and configuration of clusters built-in security and data governance centralized security administration Hortonworks Sandbox a virtual machine with Hadoop preconfigured a set of hands-on tutorials to get you started with Hadoop. an environment to help you explore related projects in the Hadoop ecosystem like Apache Pig, Apache Hive, Apache HCatalog and Apache HBase

24 MapR MapR provides access to a variety of data sources from a single computer cluster, including: big data workloads such as Apache Hadoop and Apache Spark a distributed file system a multi-model database management system event stream processing, combining analytics in real-time with operational applications its technology runs on both commodity hardware and public cloud computing services

25 IBM Open Platform ache-hadoop native support for rolling upgrades for Hadoop services support for long-running applications within YARN for enhanced reliability & security heterogeneous storage in HDFS for in-memory, SSD in addition to HDD Spark in-memory distributed compute engine Java, Python & Scala languages Apache Hadoop projects included: HDFS, YARN, MapReduce, Ambari, Hbase, Hive, Oozie, Parquet, Parquet Format, Pig, Snappy, Solr, Spark, Sqoop, Zookeeper, Open JDK, Knox, Slider free IOP Quick Start Edition for non-production software:

26 Microsoft HDInsight a fully-managed cloud service for easy, fast, and cost-effective processing of massive amounts of data uses popular open-source frameworks such as Hadoop, Spark, Hive, LLAP, Kafka, Storm, R & more enables a broad range of scenarios such as ETL, Data Warehousing, Machine Learning, IoT and more

27 Intel Distribution for Apache Hadoop n-for-apache-hadoop-software-solutions.html a distribution of Hadoop with Intel s GraphBuilder and Analytics toolkit 90-day Trial version

28 Datastax Enterprise Analytics Big Data analytics platform based on Apache Cassandra database management system which runs on the top of Apache Hadoop installation includes a proprietary solution for security management, searching data, data monitoring and visualization it comes with powerful integrated analytics system multiple models supported: key-value, tabular, JSON/Document and graph data formats real-time processing possible

29 Teradata s Hadoop for Enterprise pre-configured hardware, software and services to accelerate time to Hadoop production deep integration of tools and services in the Hadoop ecosystem, specifically in the areas of data access, data movement manageability, supportability and serviceability extends the enterprise-ready Hadoop ecosystem with advanced professional services

30 Pivotal HD an enterprise-capable, commercially supported distribution of Apache Hadoop 2.0 packages targeted to traditional Hadoop deployments patches assuring the interoperability of the components advantage of big data analytics without the overhead and complexity of a project built from scratch automatic parallelization of Map Reduce jobs to handle data at scale, thereby eliminating the need for developers to write scalable and parallel algorithms Pivotal HD Single Node VM available for free

31 Open source platforms and tools MapReduce GridGain HPCC Systems Apache Spark Apache Storm SAMOA

32 MapReduce a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster a MapReduce program is composed of: a Map() procedure (method) that performs filtering and sorting ((such as sorting students by first name into queues, one queue for each name) a Reduce() method that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies) the "MapReduce System" (also called "infrastructure" or "framework") orchestrates the processing by marshalling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing for redundancy and fault tolerance the model is a specialization of the split-apply-combine strategy for data analysis inspired by the map and reduce functions commonly used in functional programming, although their purpose in the MapReduce framework is not the same as in their original forms

33 MapReduce the key contributions of the MapReduce framework are not the actual map and reduce functions (which, for example, resemble the 1995 MPI reduce and scatter operations), but the scalability and fault-tolerance achieved for a variety of applications by optimizing the execution engine a single-threaded implementation of MapReduce will usually not be faster than a traditional (non-mapreduce) implementation any gains are usually only seen with multi-threaded implementations optimizing the communication cost is essential to a good MapReduce algorithm.[10] MapReduce libraries have been written in many programming languages, with different levels of optimization a popular open-source implementation that has support for distributed shuffles is part of Apache Hadoop the name MapReduce originally referred to the proprietary Google technology, but has since been genericized

34 GridGrain in-memory computing platform built on Apache Ignite can function as an in-memory data grid or it can be deployed as an in-memory transactional SQL database combines the speed of in-memory computing with the durability of disk-based storage used in financial services, fintech, software, healthcare, telecom, ecommerce, online services, retail, and more free 30-day trial available

35 HPCC Systems an alternative to Hadoop and other Big Data platforms an open source, data-intensive computing system platform developed by LexisNexis Risk Solutions incorporates a software architecture implemented on commodity computing clusters to provide high-performance, data-parallel processing for applications utilizing big data includes system configurations to support both parallel batch data processing (Thor) and high-performance online query applications using indexed data files (Roxie) includes a data-centric declarative programming language for parallel data processing called ECL virtual image with a pre-configured HPCC available for download

36 HPCC Systems

37 Apache Spark an open-source cluster-computing framework an interface for programming entire clusters with implicit data parallelism and fault tolerance runs on Hadoop, Mesos, standalone, or in the cloud it can access diverse data sources including HDFS, Cassandra, HBase, and S3 originally developed at the University of California, Berkeley's AMPLab the codebase was later donated to the Apache Software Foundation, which has maintained it since application programming interfaces for Java, Python, Scala, and R DataFrames with support for structured and semi-structured data Spark SQL - a domain-specific language (DSL) to manipulate DataFrames in Scala, Java, or Python

38 Apache Spark Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk

39 Apache Storm a free and open source distributed realtime computation system processing of unbounded streams of data it is doing for realtime processing what Hadoop did for batch processing simple, can be used with any programming language many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL fast: a benchmark clocked it at over a million tuples processed per second per node scalable and fault-tolerant easy to set up and operate integrates with the queueing and database technologies you already use

40 SAMOA Scalable Advanced Massive Online Analysis an open source platform for mining big data streams a collection of distributed streaming algorithms for the most common data mining and machine learning tasks (classification, clustering, regression) programing abstractions to develop new algorithms a pluggable architecture that allows it to run on different distributed stream processing engines (Storm, S4, Samza) written in Java

Hadoop. Introduction / Overview

Hadoop. Introduction / Overview Hadoop Introduction / Overview Preface We will use these PowerPoint slides to guide us through our topic. Expect 15 minute segments of lecture Expect 1-4 hour lab segments Expect minimal pretty pictures