Seoul Elasticsearch Community Meetup

Similar documents
Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

BIG DATA COURSE CONTENT

DATA SCIENCE USING SPARK: AN INTRODUCTION

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

Big Data Architect.

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Hadoop. Introduction / Overview

HDInsight > Hadoop. October 12, 2017

Ian Choy. Technology Solutions Professional

IT directors, CIO s, IT Managers, BI Managers, data warehousing professionals, data scientists, enterprise architects, data architects

Data Science and Open Source Software. Iraklis Varlamis Assistant Professor Harokopio University of Athens

R Language for the SQL Server DBA

Big Data with Hadoop Ecosystem

Stages of Data Processing

The age of Big Data Big Data for Oracle Database Professionals

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

Modern ETL Tools for Cloud and Big Data. Ken Beutler, Principal Product Manager, Progress Michael Rainey, Technical Advisor, Gluent Inc.

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

Apache Solr A Practical Approach To Enterprise Search

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

The SMACK Stack: Spark*, Mesos*, Akka, Cassandra*, Kafka* Elizabeth K. Dublin Apache Kafka Meetup, 30 August 2017.

Agenda. Spark Platform Spark Core Spark Extensions Using Apache Spark

Hadoop, Yarn and Beyond

Microsoft Perform Data Engineering on Microsoft Azure HDInsight.

Data Ingestion at Scale. Jeffrey Sica

Big Data Analytics using Apache Hadoop and Spark with Scala

Oracle Big Data. A NA LYT ICS A ND MA NAG E MENT.

Innovatus Technologies

Deploying Applications on DC/OS

Microsoft Big Data and Hadoop

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Introduction into Big Data analytics Lecture 2 Big data platforms. Janusz Szwabiński

Big Data Hadoop Course Content

Windows Azure Overview

Data Architectures in Azure for Analytics & Big Data

Talend Big Data Sandbox. Big Data Insights Cookbook

Oracle Big Data Science

Big Data Hadoop Stack

Oracle Big Data Science IOUG Collaborate 16

Oracle GoldenGate for Big Data

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Webinar Series TMIP VISION

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

microsoft

AWS Serverless Architecture Think Big

New Challenges in Big Data: Technical Perspectives. Hwanjo Yu POSTECH

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Hortonworks and The Internet of Things

Big Data on AWS. Peter-Mark Verwoerd Solutions Architect

Table 1 The Elastic Stack use cases Use case Industry or vertical market Operational log analytics: Gain real-time operational insight, reduce Mean Ti

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Microsoft Analytics Platform System (APS)

Open Source Tools as a platform for research on Microsoft Azure

exam. Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0

Cloud Computing & Visualization

CIS 612 Advanced Topics in Database Big Data Project Lawrence Ni, Priya Patil, James Tench

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks

Oracle Big Data Fundamentals Ed 2

Juxtaposition of Apache Tez and Hadoop MapReduce on Hadoop Cluster - Applying Compression Algorithms

Databases 2 (VU) ( / )

SCHEME OF TEACHING AND EXAMINATION B.E. (ISE) VIII SEMESTER (ACADEMIC YEAR )

The Datacenter Needs an Operating System

SpagoBI and Talend jointly support Big Data scenarios

Exam Questions

Creating a Recommender System. An Elasticsearch & Apache Spark approach

Towards a Real- time Processing Pipeline: Running Apache Flink on AWS

Ingest. Aaron Mildenstein, Consulting Architect Tokyo Dec 14, 2017

Joe Hummel, PhD. Visiting Researcher: U. of California, Irvine Adjunct Professor: U. of Illinois, Chicago & Loyola U., Chicago

Talend Big Data Sandbox. Big Data Insights Cookbook

Modern Data Warehouse The New Approach to Azure BI

4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015)

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

Using DC/OS for Continuous Delivery

Swimming in the Data Lake. Presented by Warner Chaves Moderated by Sander Stad

New Approaches to Big Data Processing and Analytics

Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk

The Technology of the Business Data Lake. Appendix

Hadoop course content

Challenges for Data Driven Systems

Processing Unstructured Data. Dinesh Priyankara Founder/Principal Architect dinesql Pvt Ltd.

Scalable Tools - Part I Introduction to Scalable Tools

HADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation)

Security and Performance advances with Oracle Big Data SQL

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Search Engines and Time Series Databases

Databricks, an Introduction

TOOLS FOR INTEGRATING BIG DATA IN CLOUD COMPUTING: A STATE OF ART SURVEY

20777A: Implementing Microsoft Azure Cosmos DB Solutions

API Connect. Arnauld Desprets - Technical Sale

Ingest. David Pilato, Developer Evangelist Paris, 31 Janvier 2017

Big Data and FrameWorks; Perspectives to Applied Machine Learning

arxiv: v1 [cs.dc] 20 Aug 2015

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Verteego VDS Documentation

NOSQL DATABASE SYSTEMS: DECISION GUIDANCE AND TRENDS. Big Data Technologies: NoSQL DBMS (Decision Guidance) - SoSe

Spotfire Advanced Data Services. Lunch & Learn Tuesday, 21 November 2017

BEST BIG DATA CERTIFICATIONS

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.

Transcription:

HiPIC Data Collection and Visualization using Big Data: President Election 2017 in Korea Seoul Elasticsearch Community Meetup Gangnam, Korea Aug 10 2017, PhD, jwoo5@calstatela.edu High-Performance Information Computing Center (HiPIC) California State University Los Angeles

Contents Myself Introduction To Big Data Architecture Demo

Myself Experience: Since 2002, Professor at California State University Los Angeles PhD in 2001: Computer Science and Engineering at USC Since Jan 2016 : Co-Founder of The Big Link LLC and Wiken Since 1998: R&D consulting in Hollywood Warner Bros (Matrix online game), E!, citysearch.com, ARM 등 Information Search and Integration with FAST, Lucene/Solr, Sphinx implements ebusiness applications using J2EE and middleware Since 2007: Exposed to Big Data at CitySearch.com 2012 - Present : Big Data Academic Partnerships For Big Data research and training Amazon AWS, MicroSoft Azure, IBM Bluemix Databricks, Hadoop vendors

Myself Experience (Cont d): Bring in Big Data R&D and training to Korea since 2009 Collaborating with LA city since 2016 Collect, Search, and Analyze City Data Spark, Hadoop, ElasticSearch, Solr, Java, Cloudera Sept 2013: Samsung Advanced Technology Training Institute Since 2008 Introduce Hadoop Big Data and education to Univ and Research Centers Yonsei, Gachon, DongEui US: USC, Pennsylvania State Univ, University of Maryland College Park, Univ of Bridgeport, Louisiana State Univ, California State Univ LB Europe: Univ of Luxembourg

Experience in Big Data Collaboration Council Member of IBM Spark Technology Center City of Los Angeles for OpenHub and Open Data Startup Companies in Los Angeles External Collaborator and Advisor in Big Data IMSC of USC Pennsylvania State University The Big Link, Softzen, Wiken in Korea Grants and Awards Faculty Scholarship Winner of Teradata University Network 2017 IBM Bluemix, MicroSoft Windows Azure, Amazon AWS in Research and Education Grant Partnership Academic Education Partnership with Databricks, Tableau, Qlik, Cloudera, Hortonworks, SAS, Teradata

Contents Myself Introduction To Big Data Architecture Demo

How to store Big Data How to compute Big Data Google How to store Big Data GFS Two Cores in Big Data Distributed Systems on non-expensive commodity computers How to compute Big Data MapReduce Parallel Computing with non-expensive computers Own super computers Published papers in 2003, 2004

Definition: Big Data Non-expensive frameworks that is distributed parallel systems and that can store a large scale data and process it in parallel [1, 2] Hadoop and Spark Non-expensive Super Computer More public than the traditional super computers You can store and process your applications In your university labs, small companies, research centers Others Cloud Computing Big Data services Amazon AWS, IBM Bluemix, Microsoft Azure NoSQL DB (Cassandra, MongoDB, Redis, HBase) ElasticSearch

Spark In-Memory Data Computing Faster than Hadoop MapReduce Can integrate with Hadoop and its ecosystems HDFS Amzon S3, HBase, Hive, Sequence files, Cassandra, ArcGIS, Couchbase New Programming with faster data sharing Good Iterative graph algorithms, Machine Learning Interactive query

ElasticSearch Full Text Search and Visualization Server Getting more popular than Solr ElasticSearch, Kibana, ES-Hadoop, Logstash, Based on Apache Lucene library Horizontally Scalable

ElasticSearch Elastic Stack 100% open source No enterprise edition All new versions with 5.0

ElasticSearch ES-Hadoop Elasticsearch for Hadoop Exchange data between Hadoop HDFS and ElasticSearch 12

Contents Myself Introduction To Big Data Architecture Demo

Big Data Analysis Flow Data Collection Batch API: Yelp, Google Streaming: Twitter, Apache NiFi, Kafka, Storm Open Data: Government Data Storage HDFS, S3, Object Storage, NoSQL DB (Couchbase) Hive, Pig Data Filtering Data Analysis and Science Hive, Pig, Spark, BI Tools (Datameer, Qlik, Tableau, ) Data Visualization Qlik, Datameer, Excel PowerView

Data Engineering Data Source Twitter streaming API using the keywords " 문재인 ","moonriver365", " 안철수 ", "cheolsoo0919", " 유승민 ", "yooseongmin2017", " 홍준표 ", "HongSkyangel808", " 심상정 ", "sangjungsim Roughly: April 28 2017 May 11 2017 Data Collection Apache Nifi for streaming data supports powerful and scalable directed graphs Data Storage data routing, transformation, and system mediation logic ElasticSearch Hadoop HDFS at Azure

Data Engineering (Cont d) Data Analysis and Prediction: In the future Spark ML, Spark SQL, Hadoop Hive Data Visualization Kibana in ElasticSearch

Apache NiFi NiFi-1.1.2: gettwitter, putelasticsearch5, puthdfs

Hadoop Spark Cluster: HDInsight in Azure vcores Memory Local SSD (GB) (GB) 4 28 200

ElasticSearch in HDInsights Did not launch ElasticSearch Service in Azure Instead, install ES5 in Linux Head Node of HDInsights cluster ElasticSearch 5.3.1 Kibana 5.3.2

Mapping to ES Temp-Spatial Analysis For matching the Twitter date format to ES curl -XPUT localhost:9200/_template/elect17 -d ' { "template" : "elect17*", "settings" : { "number_of_shards" : 1 }, "mappings" : { "default" : { "properties" : { "created_at" : { "type" : "date", "format" : "EEE MMM dd HH:mm:ss Z YYYY" },

Mapping to ES (Cont d) "coordinates" : { "properties" : { "coordinates" : { "type" : "geo_point" }, "type" : { "type" : "string" } } }, "user" : { "properties" : { "screen_name" : { "type" : "string", "index" : "not_analyzed" },

Mapping to ES (Cont d) "lang" : { "type" : "string", "index" : "not_analyzed" } } } } } } }'

K-Election 2017 (April 29 May 9)

K-Election 2017 (April 29 May 9)

ES-Hadoop Install ES-Hadoop $ wget -P /tmp http://download.elastic.co/hadoop/elasticsearchhadoop-5.3.1.zip $ unzip /tmp/elasticsearch-hadoop-5.3.1.zip -d /tmp $ cp /tmp/elasticsearch-hadoop-5.3.1/dist/elasticsearch-hadoop- 5.3.1.jar /tmp/elasticsearch-hadoop-5.3.1.jar $ hdfs dfs -copyfromlocal /tmp/elasticsearch-hadoop- 5.3.1/dist/elasticsearch-hadoop-5.3.1.jar /tmp $ sudo cp elasticsearch-spark-20_2.11-5.3.1.jar /usr/hdp/current/spark2-client/

ES-Hadoop (Cont d) Add ES-Hadoop libraries to Hive with one of the followings: $ hive hive> add jar hdfs:///tmp/elasticsearch-hadoop-5.3.1.jar hive> add jar /tmp/elasticsearch-hadoop-5.3.1.jar hive> add jar file:///tmp/elasticsearch-hadoop-5.3.1.jar hive > list jar ; file:///tmp/elasticsearch-hadoop-5.3.1.jar

ES-Hadoop (Cont d) hive> select * from elect17_test LIMIT 10; OK 856281525070909440 NULL NULL NULL NULL RT @sydbris: 이정도는우리문재인후보님이절대말씀하시지않겠지. " 넌내가유신반대투쟁하고민주화운동할때친구들이랑고대앞하숙방에모여서 xx 모의했냐?" Sun Apr 23 22:59:59 +0000 2017 856281524995407872 NULL NULL NULL NULL RT @choomiae: 존경하는시흥시민여러분!

Contents Myself Introduction To Big Data Architecture Demo

Demo Azure Portal Ubuntu VM ElasticSearch NiFi Kibana: April 29 May 10 Hive with ES-Hadoop Test with the data on April 23 April 24

Spark Big Data Training and R&D HiPIC California State University Los Angeles Supported by Databricks and its cloud computing services Amazon AWS, IBM Buemix, MS Azure Hortonworks, Cloudera Teradata ElasticSearch Qlik, Tableau

Databricks Partners

Training Hadoop and Spark Cloudera visits to interview

Training Hadoop on IBM Bluemix at California State Univ. Los Angeles

Conclusion K-Elect 2017 in ES5 and HDInsights ES5 Easy to collect and visualize HDInsights Data and Predict Analysis possible

Question?

References 1. Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing, and Yuhang Xu, The 2011 international Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA 2011), Las Vegas (July 18-21, 2011) 2., DMKD-00150, Market Basket Analysis Algorithms with MapReduce, Wiley Interdisciplinary Reviews Data Mining and Knowledge Discovery, Oct 28 2013, Volume 3, Issue 6, pp445-452, ISSN 1942-4795 3., Big Data Trend and Open Data, UKC 2016, Dallas, TX, Aug 12 2016

References (Cont d) 4. Business Data Analysis LA at Databricks, HiPIC of, Jongwook Woo https://docs.databricks.com/spark/latest/training/cal-state-labiz-data-la.html 5. https://github.com/hipic/spark_mba, HiPIC of California State University Los Angeles 6. Hadoop, http://hadoop.apache.org 7. Databricks, http://www.databricks.com 8. DS320: DataStax Enterprise Analytics with Spark 9. Cloudera, http://www.cloudera.com 10.Hortonworks, http://www.hortonworks.com