Refactoring Earthquake-Tsunami Causality and Messaging via Big Data Analytics: The Transformative Potential of Credible Tweets

Size: px

Start display at page:

Download "Refactoring Earthquake-Tsunami Causality and Messaging via Big Data Analytics: The Transformative Potential of Credible Tweets"

Ralph Black
5 years ago
Views:

1 Refactoring Earthquake-Tsunami Causality and Messaging via Big Data Analytics: The Transformative Potential of Credible Tweets L. I. Lumb1,2 & J. R. Freemantle3 1 York University, 2Univa Corporation & 3Independent MCBDA 2016 (First Workshop) PVAMU, May 17, 2016

2 Agenda Motivation Traditional Data Social-Networking Data Graphs, Semantics & Machine Learning Conclusions

3 Geist, E.L., Titov, V.V., and Synolakis, C.E., 2006, Tsunami: wave of change: Scientific American, v. 294, p

7 Motivation Non-deterministic cause Lead times Availability of actionable observations Communication of situation - advisories, warnings, etc. Cause-effect relationship Uncertainty inherent in any attempt to predict earthquakes In situ measurements may reduce uncertainty Energy transfer - inputs... coupling... outputs Geometry - bathymetry and topography Other factors - e.g., tides Established effect Far-field estimates of tsunami propagation (pre-computed) and coastal inundation (real-time) have proven to be extremely accurate... requires Distributed array of deep-ocean tsunami detection buoys + forecasting model

8 Agenda Motivation Traditional Data Social-Networking Data Graphs, Semantics & Machine Learning Conclusions

11 Lumb & Aldridge,

12 Agenda Motivation Traditional Data Social-Networking Data Graphs, Semantics & Machine Learning Conclusions

14 6Vs: Scientific vs. Social Networking Data GGP Scientific Data Twitter SN Data Volume small, finite BIG, infinite Variety semi-structured, restricted unstructured, unrestricted except for IDs, hashtags & URLs (pages, images) Velocity slow, sampled fast, streamed Veracity biases, noise & abnormalities Validity accuracy & correctness Volatility low (stationary, irreplaceable) high? (mobile?, disposable?)

15 Machine Learning Pipeline Karau et al., Learning Spark, O Reilly, 2015

16 Deep Learning from Twitter? Represent data Twitter data manually curated into ham and spam In-memory representation via Spark RDDs Extract features Frequency-based usage via Spark MLlib HashingTF feature vectors Develop model object Spark MLlib LogisticRegressionWithSGD used for classification Evaluate model

18 Future Work Machine Learning Multiparameter credibility - TweetCred + ML + RDF/OWL GA Cloud-native platform Classification algorithms... with categories? Training Experiments Larger data sets Degrees of hammyness Stop-word removal, stemming,... Real-time streaming - data from Twitter Containerization, dynamic scheduling and micro services Other examples Alberta wildfires Industrial incidents Hurricanes

19 Agenda Motivation Traditional Data Social-Networking Data Graphs, Semantics & Machine Learning Conclusions

20 Conclusions Credible tweets could be transformative Mission-critical Big Data complement to existing data sources and approaches Current challenges/opportunities Twitter Data Extraction - only 100 tweets at a time (!!!) Curation - manual (read: time consuming!!!) Emphasizing Machine Learning... appears encouraging, BUT... Graph Analytics... as well??? Semantics... as well???

21 Q&A L. I. Lumb1,2 & J. R. Freemantle3 1 ianlumb@yorku.ca, 2ilumb@univa.com & 3james. freemantle@rogers.com

22 jp/jma/en/2016_kumamoto_earthquake/2016_kumamoto_earthq uake.html Graph Analytics Problem

24 Perl script prototype Acquires tweets with the keyword earthquake use Net::Twitter::Lite::WithAPIv1_1; my $nt = Net::Twitter::Lite::WithAPIv1_1->new( consumer_key => 'xxxx...xxxxxxx', consumer_secret => 'xxxxxx...xxxxxxxxxx', access_token => 'xxxxx...xxxxxxxxxxx', access_token_secret => 'xxxxx...xxxxxxxxxxx', ssl => 1 ); my $result = $nt->search("earthquake"); for my $status(@{$result->{statuses}} ) { print "$status->{text}\n"; }

25 Resilient Distributed Datasets (RDDs) Abstraction for in-memory computing Fault-tolerant, parallel data structures o Cluster-ready Optionally persistent Can be partitioned for optimal placement Manipulated via operators Zaharia et al., NSDI 2012

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation Analytic Cloud with Shelly Garion IBM Research -- Haifa 2014 IBM Corporation Why Spark? Apache Spark is a fast and general open-source cluster computing engine for big data processing Speed: Spark is capable