New Challenges in Big Data: Technical Perspectives. Hwanjo Yu POSTECH

Size: px

Start display at page:

Download "New Challenges in Big Data: Technical Perspectives. Hwanjo Yu POSTECH"

Fay Shepherd
5 years ago
Views:

1 New Challenges in Big Data: Technical Perspectives Hwanjo Yu POSTECH

2 Over 1 Billion SNS users!!

3 Viral Marketing Word-of-Mouth Effect > TV advertising

4 Influence Maximization in Social Networks Find k people that maximize influence spread!

5 IPA: Scalable and Parallelizable Processing of Influence Maximization for Large-Scale Social Network [ICDE 2013, best poster award] 1. 10x times faster than PMIA (the state-of-the-art algorithm) 2. Uses much less memory than PMIA; IPA successfully produces results on graphs of millions of nodes using 4GB memory where PMIA fails with 24GB memory. 3. Accurately approximates influence spread; IPA s accuracy is close to that of Greedy solutions with 20k times MC simulation and is higher than that of PMIA overall. 4. Can be applied to all IC-based models; PMIA cannot be applied to CT-IC model. 5. Easily parallelized; The parallel IPA speeds up as # of CPU cores increases, and more speed-up is achieved for larger data sets. 5

6 Scaling up to Billion-Nodes Network using Map-Reduce? Very Hard! Something is easily parallelized does NOT mean it can be easily map-reduced. Big data processing Parallel data processing How different?

7 Big Data Analysis System Enterprise DBMS Structured + Unstructured Data App Structured Data: RDBMS, DW SQL Hive, Pig, R Hadoop App HBase Data volume, variety, velocity increase => Storage (DAS, NAS, SAN) cost increase, Analysis is hard (unstructured >> structured) HDFS, Swift Scale-out cluster

8 Big Data Analysis System Structured + Unstructured Data App High level Language Hive, Pig, R App DB or Data Access Hadoop HBase Distributed File System HDFS, Swift Storage Scale-out cluster

Storage Trend Centralized storage: SAN, NAS Distributed storage servers storage Network, RAID Big data => Need scalability Network, distributed file system

9 Storage Trend Centralized storage: SAN, NAS Distributed storage servers storage Network, RAID Big data => Need scalability Network, distributed file system Proprietary, Highly reliable HW Commodity HW => Scale-up: Expensive => Scale-out: Inexpensive => Fast data transfer => Slow data transfer => Need new programming model!

10 Trend Evolution Data Trend (Big Data) Storage Trend (Distributed) : Inexpensive Scale-out, but Expensive Data Transfer! Need New Programming Model to Minimize Data Transfer Move operations instead of data! MapReduce by Google Hadoop and many subprojects 10

11 MapReduce Principles Run operation on data nodes: Move operations to Data Minimize data transfer Design Tips Lower the work of reduce Use combine if possible Compression of map s output helps decreasing network overhead Minimize iterations and broadcasting Sharing information is minimized Use bulk reading Too many invocation of map may incur too many function calls A straightforward extension of parallel IPA algorithm produce too many iterations and heavy data transfer from map to reduce Design algorithm to have enough reduce functions Having only a single reduce will not speed up

12 Big Data Subprojects Big data programming framework MapReduce (Batch): HDFS & Hadoop, Dryad MapReduce (Iterative): HaLoop, Twister MapReduce (Streaming): Storm (Twitter), S4 (Yahoo), InfoSphere Streams (IBM), HStreaming NoSQL DB HBase (Master, slaves), Cassandra (P2P, Gossip, no master server), Dynamo (Amazon), MongoDB (for text), Graph processing engine Pregel, Giraph, Trinity, Neo4J, TurboGraph IoT platform NoSQL DB + Analytics solutions Allseen, Predix

13 App Search Recommendation Social Network BI Bio Minimize Data Transfer Which platform? Generalization Feasible? Approximate? SSD-aware mining SW Platform HW Infra Big Data subprojects: MapReduce, NoSQL DB Storage Move CPU to Data Minimize Data Transfer Search, Recommendation,.. Text, Graph, Multimedia,.. Batch, Streaming SSD-aware platform Scalability Scale-out cost Energy efficiency Load balancing SSD where?

14 SSD vs. HDD 1600 Terasort: HDD (32 nodes) vs SSD (16 nodes) Total Time (sec) HDD(32 node) SSD(eMLC) Data size (GB)

Tiering model Put hot data to SSD and cold data to HDD Data migration?

15 SSD: where? Replacement model Replace HDD with SSD Throw out HDD? Big data => Expensive scale-out? HDD SSD Caching model Use SSD as cache between memory and HDD Ratio of SSD and HDD? Data duplication? Tiering model Put hot data to SSD and cold data to HDD Data migration? Cache Storage Tier-0 (Hot data) Tier-1 (Cold data) SSD HDD SSD HDD Distributed model Don t care migration, don t care ratio, no duplication, no need to throw out HDDs Load balancing by Hadoop

16 Reality Linkedin Develop NoSQL database Voldemort which uses SSDs Twitter Optimize MySQL for SSDs (e.g., page-flushing behavior, reduction in writes to disk) Amazon Develop SSD-based NoSQL database DynamoDB as a new service in AWS EBay Replace its internal virtual storage layer with 100TB SSDs (2011) Saw 50% reduction in rackspace, 78% drop in power consumption and a 5 times boost in I/O performance. Microsoft Replace Bing Search runtime filesystem with Intel SSD (2011) Uses Intel SSDs in their KeyValue storage for Bing social search Microsoft Research is working on Flash Server Farm Called CORFU (Cluster Of Raw Flash Units) Facebook Improve MySQL performance by adding Fusion-io as caching layer

17 SSD: what else? Fast random access Supporting Parallel I/O => Enabling Overlapped Processing between CPU and I/O

18 Overlapped Processing of Big Graph Analytics and Big Matrix Factorization TurboGraph: A Fast Parallel Graph Engine Handling Billion-scale Graphs in a Single PC, KDD, OPT: A New Framework for Overlapped and Parallel Triangulation in Large-scale Graphs, SIGMOD, Fast and Robust Parallel Matrix Factorization for Recommendation KDD, 2015.

19 Relevance Feedback Search

25 RefMed: Relevance Feedback Search Engine for PubMed POSTECH Data Mining Lab. 25

26 RefMed: Relevance Feedback Search Engine for PubMed POSTECH Data Mining Lab. 26

27 Research Issues and Related Publications Accurately learning ranking functions Rank Learning Algorithms [H Yu and S Kim ACM CIKM 2010], [H Yu.. PAKDD 2009] Selective Sampling [H Yu and S Kim IEEE ICDM 2010], [H Yu DAMI 2010] Efficiently processing the ranking functions to return top-k quickly Indexing SVM ranking function [H Yu.. ACM SIGMOD 2011] Fast processing of ranking function for sequence data [WS Han.. H Yu, ACM SIGMOD 2011] Seamlessly integrating into RDBMS [H Yu.. KDD 2012 (Demo)], [H Yu.. CIKM 2009], [H Yu.. BMC Bioinformatics 2010] RefMED Search Engine =>

28 Blackbox Video Search [Knowledge Based Systems 2014]

30 GeoSearch : Georeferenced Video Search

31 GeoSearch : Georeferenced Video Search

GeoTree GeoTree : Indexing tree for georeferenced video search GeoTree is a kind of R-Tree which adopts MBTR in its leaf nodes MBTR : Minimum Bounding

32 GeoTree GeoTree : Indexing tree for georeferenced video search GeoTree is a kind of R-Tree which adopts MBTR in its leaf nodes MBTR : Minimum Bounding Tilited Rectangle MBTR enables fast query processing. MBR-Filter R-Tree GeoTree Mean Processing Time and Std.deviation

33 Novel Recommendation [ICDM 2011]

34 Recommendation These movies are awesome! Best movies of my life! Hmm Then, how about this one? 34

35 Novel Recommendation Existing recommendation systems (e.g., the Netflix system) focus on an accurate prediction of purchase Evaluated based on prediction accuracy Tend to recommend popular items Recommending popular items, however, is not effective. Users likely already know the items and likely have premade decisions on the purchase of items, e.g., recommend to watch Star wars or Titanic. High accuracy but low effectiveness What is effective recommendation? Recommend unexpected (or novel) items that could surprise users and affect users purchase decision 35

36 Popularity 36

37 Personal Popularity Tendency User User 585 Weight Weight Box office gross (logscale) Box office gross (logscale)

38 Incorporating PPT difference into recommendation Objective : Maximize recommendation accuracy while minimizing PPT difference Balance between accuracy and novelty Transform to bin-packing problem => Greedy algorithm 38

39 Accuracy Coverage Diversity 39

40 Timing When to Recommend [Information Sciences 2014]

41 Overlapped Processing of Big Graph Analytics and Big Matrix Factorization TurboGraph: A Fast Parallel Graph Engine Handling Billion-scale Graphs in a Single PC, KDD, OPT: A New Framework for Overlapped and Parallel Triangulation in Large-scale Graphs, SIGMOD, Fast and Robust Parallel Matrix Factorization for Recommendation KDD, 2015.

42 Ongoing Projects High level Analytics Life Log Machine Learning Mining Minds (Layered Knowledge Bases for high level services) Mining Unstructured Data and Personalized Recommendation Services Mining Network Logs for Security (Hacking or Virus?) Cancer Subtype Detection and Prediction Processing Platforms Big Data Processing Platforms for Video and Multimedia Data Big Graph Processing Engines

43 POSTECH Clusters : 150 nodes * nodes of 12 cores, 24G mem, 3T HDD, Connected by 2G network nodes of 16 cores, 36G mem, 4T HDD, 200G SSD, Connected by 40G network

44 Facts on POSTECH POSTECH s ranking 23 rd, 53 rd, 50 th at The TIMES ranking of world Universities at 2010, 2011, 2012 respectively (1 st in Korea, 6 th in Asia) 1 st, in the world Universities of under 50 years old at POSTECH attains the largest research funds per faculty among the Universities in Korea Undergraduate students are top 1% in high schools in Korea Bilingual campus: English and Korean POSTECH Data Mining Lab. 67

45 Thank You! Q/A

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?