Seoul Elasticsearch Community Meetup

Size: px

Start display at page:

Download "Seoul Elasticsearch Community Meetup"

Cameron Floyd
5 years ago
Views:

1 HiPIC Data Collection and Visualization using Big Data: President Election 2017 in Korea Seoul Elasticsearch Community Meetup Gangnam, Korea Aug , PhD, High-Performance Information Computing Center (HiPIC) California State University Los Angeles

2 Contents Myself Introduction To Big Data Architecture Demo

3 Myself Experience: Since 2002, Professor at California State University Los Angeles PhD in 2001: Computer Science and Engineering at USC Since Jan 2016 : Co-Founder of The Big Link LLC and Wiken Since 1998: R&D consulting in Hollywood Warner Bros (Matrix online game), E!, citysearch.com, ARM 등 Information Search and Integration with FAST, Lucene/Solr, Sphinx implements ebusiness applications using J2EE and middleware Since 2007: Exposed to Big Data at CitySearch.com Present : Big Data Academic Partnerships For Big Data research and training Amazon AWS, MicroSoft Azure, IBM Bluemix Databricks, Hadoop vendors

4 Myself Experience (Cont d): Bring in Big Data R&D and training to Korea since 2009 Collaborating with LA city since 2016 Collect, Search, and Analyze City Data Spark, Hadoop, ElasticSearch, Solr, Java, Cloudera Sept 2013: Samsung Advanced Technology Training Institute Since 2008 Introduce Hadoop Big Data and education to Univ and Research Centers Yonsei, Gachon, DongEui US: USC, Pennsylvania State Univ, University of Maryland College Park, Univ of Bridgeport, Louisiana State Univ, California State Univ LB Europe: Univ of Luxembourg

5 Experience in Big Data Collaboration Council Member of IBM Spark Technology Center City of Los Angeles for OpenHub and Open Data Startup Companies in Los Angeles External Collaborator and Advisor in Big Data IMSC of USC Pennsylvania State University The Big Link, Softzen, Wiken in Korea Grants and Awards Faculty Scholarship Winner of Teradata University Network 2017 IBM Bluemix, MicroSoft Windows Azure, Amazon AWS in Research and Education Grant Partnership Academic Education Partnership with Databricks, Tableau, Qlik, Cloudera, Hortonworks, SAS, Teradata

6 Contents Myself Introduction To Big Data Architecture Demo

7 How to store Big Data How to compute Big Data Google How to store Big Data GFS Two Cores in Big Data Distributed Systems on non-expensive commodity computers How to compute Big Data MapReduce Parallel Computing with non-expensive computers Own super computers Published papers in 2003, 2004

8 Definition: Big Data Non-expensive frameworks that is distributed parallel systems and that can store a large scale data and process it in parallel [1, 2] Hadoop and Spark Non-expensive Super Computer More public than the traditional super computers You can store and process your applications In your university labs, small companies, research centers Others Cloud Computing Big Data services Amazon AWS, IBM Bluemix, Microsoft Azure NoSQL DB (Cassandra, MongoDB, Redis, HBase) ElasticSearch

9 Spark In-Memory Data Computing Faster than Hadoop MapReduce Can integrate with Hadoop and its ecosystems HDFS Amzon S3, HBase, Hive, Sequence files, Cassandra, ArcGIS, Couchbase New Programming with faster data sharing Good Iterative graph algorithms, Machine Learning Interactive query

10 ElasticSearch Full Text Search and Visualization Server Getting more popular than Solr ElasticSearch, Kibana, ES-Hadoop, Logstash, Based on Apache Lucene library Horizontally Scalable

11 ElasticSearch Elastic Stack 100% open source No enterprise edition All new versions with 5.0

12 ElasticSearch ES-Hadoop Elasticsearch for Hadoop Exchange data between Hadoop HDFS and ElasticSearch 12

13 Contents Myself Introduction To Big Data Architecture Demo

14 Big Data Analysis Flow Data Collection Batch API: Yelp, Google Streaming: Twitter, Apache NiFi, Kafka, Storm Open Data: Government Data Storage HDFS, S3, Object Storage, NoSQL DB (Couchbase) Hive, Pig Data Filtering Data Analysis and Science Hive, Pig, Spark, BI Tools (Datameer, Qlik, Tableau, ) Data Visualization Qlik, Datameer, Excel PowerView

15 Data Engineering Data Source Twitter streaming API using the keywords " 문재인 ","moonriver365", " 안철수 ", "cheolsoo0919", " 유승민 ", "yooseongmin2017", " 홍준표 ", "HongSkyangel808", " 심상정 ", "sangjungsim Roughly: April May Data Collection Apache Nifi for streaming data supports powerful and scalable directed graphs Data Storage data routing, transformation, and system mediation logic ElasticSearch Hadoop HDFS at Azure

16 Data Engineering (Cont d) Data Analysis and Prediction: In the future Spark ML, Spark SQL, Hadoop Hive Data Visualization Kibana in ElasticSearch

17 Apache NiFi NiFi-1.1.2: gettwitter, putelasticsearch5, puthdfs

18 Hadoop Spark Cluster: HDInsight in Azure vcores Memory Local SSD (GB) (GB)

19 ElasticSearch in HDInsights Did not launch ElasticSearch Service in Azure Instead, install ES5 in Linux Head Node of HDInsights cluster ElasticSearch Kibana 5.3.2

20 Mapping to ES Temp-Spatial Analysis For matching the Twitter date format to ES curl -XPUT localhost:9200/_template/elect17 -d ' { "template" : "elect17*", "settings" : { "number_of_shards" : 1 }, "mappings" : { "default" : { "properties" : { "created_at" : { "type" : "date", "format" : "EEE MMM dd HH:mm:ss Z YYYY" },

21 Mapping to ES (Cont d) "coordinates" : { "properties" : { "coordinates" : { "type" : "geo_point" }, "type" : { "type" : "string" } } }, "user" : { "properties" : { "screen_name" : { "type" : "string", "index" : "not_analyzed" },

22 Mapping to ES (Cont d) "lang" : { "type" : "string", "index" : "not_analyzed" } } } } } } }'

23 K-Election 2017 (April 29 May 9)

24 K-Election 2017 (April 29 May 9)

25 ES-Hadoop Install ES-Hadoop $ wget -P /tmp $ unzip /tmp/elasticsearch-hadoop zip -d /tmp $ cp /tmp/elasticsearch-hadoop-5.3.1/dist/elasticsearch-hadoop jar /tmp/elasticsearch-hadoop jar $ hdfs dfs -copyfromlocal /tmp/elasticsearch-hadoop /dist/elasticsearch-hadoop jar /tmp $ sudo cp elasticsearch-spark-20_ jar /usr/hdp/current/spark2-client/

26 ES-Hadoop (Cont d) Add ES-Hadoop libraries to Hive with one of the followings: $ hive hive> add jar hdfs:///tmp/elasticsearch-hadoop jar hive> add jar /tmp/elasticsearch-hadoop jar hive> add jar file:///tmp/elasticsearch-hadoop jar hive > list jar ; file:///tmp/elasticsearch-hadoop jar

27 ES-Hadoop (Cont d) hive> select * from elect17_test LIMIT 10; OK NULL NULL NULL NULL 이정도는우리문재인후보님이절대말씀하시지않겠지. " 넌내가유신반대투쟁하고민주화운동할때친구들이랑고대앞하숙방에모여서 xx 모의했냐?" Sun Apr 23 22:59: NULL NULL NULL NULL 존경하는시흥시민여러분!

28 Contents Myself Introduction To Big Data Architecture Demo

29 Demo Azure Portal Ubuntu VM ElasticSearch NiFi Kibana: April 29 May 10 Hive with ES-Hadoop Test with the data on April 23 April 24

30 Spark Big Data Training and R&D HiPIC California State University Los Angeles Supported by Databricks and its cloud computing services Amazon AWS, IBM Buemix, MS Azure Hortonworks, Cloudera Teradata ElasticSearch Qlik, Tableau

31 Databricks Partners

32 Training Hadoop and Spark Cloudera visits to interview

33 Training Hadoop on IBM Bluemix at California State Univ. Los Angeles

34 Conclusion K-Elect 2017 in ES5 and HDInsights ES5 Easy to collect and visualize HDInsights Data and Predict Analysis possible

35 Question?

36 References 1. Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing, and Yuhang Xu, The 2011 international Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA 2011), Las Vegas (July 18-21, 2011) 2., DMKD-00150, Market Basket Analysis Algorithms with MapReduce, Wiley Interdisciplinary Reviews Data Mining and Knowledge Discovery, Oct , Volume 3, Issue 6, pp , ISSN , Big Data Trend and Open Data, UKC 2016, Dallas, TX, Aug

37 References (Cont d) 4. Business Data Analysis LA at Databricks, HiPIC of, Jongwook Woo HiPIC of California State University Los Angeles 6. Hadoop, 7. Databricks, 8. DS320: DataStax Enterprise Analytics with Spark 9. Cloudera, 10.Hortonworks,

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Agenda End-to-End Data Delivery Platform Ecosystem of Data Technologies Mapping an End-to-End Solution Case