Social Media & Text Analysis

Size: px

Start display at page:

Download "Social Media & Text Analysis"

Carmel Lane
5 years ago
Views:

1 Social Media & Text Analysis lecture 5 - POS/NE Tagging CSE Ohio State University Instructor: Alan Ritter Website: socialmedia-class.org

2 NLP Pipeline (summary so far) classification (Naïve Bayes) Regular Expression Language Identification Tokenization Part-of- Speech (POS) Tagging Shallow Parsing (Chunking) Named Entity Recognition (NER) Stemming Normalization Alan Ritter socialmedia-class.org

3 NLP Pipeline (next) Language Identification Tokenization Part-of- Speech (POS) Tagging Shallow Parsing (Chunking) Named Entity Recognition (NER) Stemming Sequential Tagging Normalization Alan Ritter socialmedia-class.org

4 Challenge: Natural Language Processing Breaks 4

5 Challenge: Natural Language Processing Breaks LOCATION PERSON 4

6 Challenge: Natural Language Processing Breaks Stanford NER: LOCATION PERSON ~50% Drop Newswire Twitter

7 Part-of-Speech (POS) Tagging Cant MD wait VB for IN the DT ravens NNP game NN tomorrow NN : go VB ray NNP rice NNP!!!!!!!. Alan Ritter socialmedia-class.org

8 Alan Ritter socialmedia-class.org Penn Treebank POS Tags

9 Part-of-Speech (POS) Tagging Words often have more than one POS: - The back door = JJ - On my back = NN - Win the voters back = RB - Promised to back the bill = VB POS tagging problem is to determine the POS tag for a particular instance of a word. Alan Ritter socialmedia-class.org Source: adapted from Chris Manning

10 Twitter-specific Tags url address emoticon discourse marker symbols Alan Ritter socialmedia-class.org Source: Gimpel et al. Part-of-Speech Tagging for Twitter : Annotation, Features, and Experiments ACL 2011

11 Noisy Text: Challenges Lexical Variation (misspellings, abbreviations) `2m', `2ma', `2mar', `2mara', `2maro', `2marrow', `2mor', `2mora', `2moro', `2morow', `2morr', `2morro', `2morrow', `2moz', `2mr', `2mro', `2mrrw', `2mrw', `2mw', `tmmrw', `tmo', `tmoro', `tmorrow', `tmoz', `tmr', `tmro', `tmrow', `tmrrow', `tmrrw', `tmrw', `tmrww', `tmw', `tomaro', `tomarow', `tomarro', `tomarrow', `tomm', `tommarow', `tommarrow', `tommoro', `tommorow', `tommorrow', `tommorw', `tommrow', `tomo', `tomolo', `tomoro', `tomorow', `tomorro', `tomorrw', `tomoz', `tomrw', `tomz Unreliable Capitalization The Hobbit has FINALLY started filming! I cannot wait! Unique Grammar watchng american dad. 7 Alan Ritter socialmedia-class.org

12 Chunking Cant wait for the ravens game tomorrow go ray rice!!!!!!! VP PP NP NP VP NP Alan Ritter socialmedia-class.org

13 Chunking recovering phrases constructed by the part-of-speech tags a.k.a shallow (partial) parsing: - full parsing is expensive, and is not very robust - partial parsing can be much faster, more robust, yet sufficient for many applications - useful as input (features) for named entity recognition or full parser Alan Ritter socialmedia-class.org

14 Named Entity Recognition(NER) Cant wait for the ravens ORG game tomorrow go ray PER rice!!!!!!!. ORG: organization PER: person LOC: location Alan Ritter socialmedia-class.org

15 NER: Basic Classes Cant wait for the ravens ORG game tomorrow go ray PER rice!!!!!!!. ORG: organization PER: person LOC: location Alan Ritter socialmedia-class.org

16 Noisy Text: NLP breaks POS: Chunk: NER:

17 Noisy Text: NLP breaks POS: Chunk: NER:

18 Noisy Text: NLP breaks POS: Chunk: NER:

19 Noisy Text: NLP breaks POS: Chunk: NER:

20 Noisy Text: NLP breaks POS: Chunk: NER: Noisy Style

21 NER: Rich Classes Alan Ritter socialmedia-class.org Source: Strauss, Toma, Ritter, de Marneffe, Xu Results of the WNUT16 Named Entity Recognition Shared Task 2016)

22 NER: Genre Differences PER News Politicians, business leaders, journalists, celebrities Tweets Sportsmen, actors, TV personalities, celebrities, names of friends LOC Countries, cities, rivers, and other places related to current affairs Restaurants, bars, local landmarks/areas, cities, rarely countries ORG Public and private companies, government organisations Bands, internet companies, sports clubs Alan Ritter socialmedia-class.org Source: Kalina Bontcheva and Leon Derczynski Tutorial on Natural Language Processing for Social Media EACL 2014

23 Weakly Supervised NER Freebase / Wikipedia lists provide a source of supervision But these lists are highly ambiguous Example: China

24 Weakly Supervised NER Freebase / Wikipedia lists provide a source of supervision But these lists are highly ambiguous Example: China

25 Weakly Supervised NER Freebase / Wikipedia lists provide a source of supervision But these lists are highly ambiguous Example: China

26 Weakly Supervised NER Freebase / Wikipedia lists provide a source of supervision But these lists are highly ambiguous Example: China

27 Weakly Supervised NER Freebase / Wikipedia lists provide a source of supervision But these lists are highly ambiguous Example: China

28 Distant Supervision with Latent Variables [Ritter, et. al. EMNLP 2011] Latent variable model for Named Entity Categorization with constraints

29 [Ritter, et. al. EMNLP 2011] Apple Obama JFK On my way to JFK early in the JFK 's bomber jacket sells for JFK Airport s Pan Am Worldport Waiting at JFK for our ride When JFK threw first pitch on " "

30 [Ritter, et. al. EMNLP 2011] Apple Obama JFK On my way to JFK early in the JFK 's bomber jacket sells for JFK Airport s Pan Am Worldport Waiting at JFK for our ride When JFK threw first pitch on " " s 0.04 threw 0.02 jacket 0.01 waiting 0.04 ride 0.03 way 0.02 announced 0.04 new 0.03 release 0.02 PERSON FACILITY PRODUCT

31 [Ritter, et. al. EMNLP 2011] Apple Obama JFK On my way to JFK early in the JFK 's bomber jacket sells for JFK Airport s Pan Am Worldport Waiting at JFK for our ride When JFK threw first pitch on " " s 0.04 threw 0.02 jacket 0.01 waiting 0.04 ride 0.03 way 0.02 announced 0.04 new 0.03 release 0.02 PERSON FACILITY PRODUCT

32 [Ritter, et. al. EMNLP 2011] Apple Obama JFK On my way to JFK early in the JFK 's bomber jacket sells for JFK Airport s Pan Am Worldport Waiting at JFK for our ride When JFK threw first pitch on " " X s 0.04 threw 0.02 jacket 0.01 waiting 0.04 ride 0.03 way 0.02 announced 0.04 new 0.03 release 0.02 PERSON FACILITY PRODUCT

33 [Ritter, et. al. EMNLP 2011] JFK Apple Obama On my way to JFK early in the JFK 's bomber jacket sells for JFK Airport s Pan Am Worldport Waiting at JFK for our ride When JFK threw first pitch on " " X s 0.04 threw 0.02 jacket 0.01 waiting 0.04 ride 0.03 way 0.02 announced 0.04 new 0.03 release 0.02 PERSON FACILITY PRODUCT

34 [Ritter, et. al. EMNLP 2011] Example Type Lists

35 [Ritter, et. al. EMNLP 2011] Example Type Lists KKTNY = Kourtney and Kim Take New York RHOBH = Real Housewives of Beverly Hills

36 [Ritter, et. al. EMNLP 2011] Example Type Lists KKTNY = Kourtney and Kim Take New York RHOBH = Real Housewives of Beverly Hills

37 Twitter NER: [Ritter, et. al. EMNLP 2011] F1 Classification Results Majority Baseline Freebase Baseline Supervised Baseline DL-Cotrain LLDA (Collins and Singer 99)

38 Twitter NER: [Ritter, et. al. EMNLP 2011] F1 0.7 Classification Results 25% increase in F Majority Baseline Freebase Baseline Supervised Baseline DL-Cotrain LLDA (Collins and Singer 99)

39 Alan Ritter socialmedia-class.org Tool: twitter_nlp

40 Alan Ritter socialmedia-class.org Tool: twitter_nlp

41 Results of the WNUT16 Named Entity Recognition Shared Task Benjamin Strauss, Bethany Toma, Alan Ritter, Marie- Catherine de Marneffe and Wei Xu

42 Need for Shared Evaluations Fast Moving Area: Papers published in the same year use different datasets and evaluation methodology Performance still behind what we would like ~ F1 score (much lower than news) Explore new ideas & approaches

43 Related NER Evaluations MUC Newswire CONLL Newswire ACE Newswire Named Entity recognition and Linking (NEEL) Challenge Microblogs #Microposts workshop at WWW

44 Twitter NER Evaluation Summary Re-Run of 2015 Task 2 Subtasks Segmentation + 10 way classification Segmentation only (no classification)

45 Twitter NER Evaluation Summary Re-Run of 2015 Task 2 Subtasks Segmentation + 10 way classification Segmentation only (no classification) New test set annotated for 2016

46 Twitter NER Evaluation Summary Re-Run of 2015 Task 2 Subtasks Segmentation + 10 way classification Segmentation only (no classification) New test set annotated for Participating Teams

47 Data Training + Dev Data: All training, dev, test data from 2015 Training: 2,394 tweets, Dev: 1,420 tweets

48 Data Training + Dev Data: All training, dev, test data from 2015 Training: 2,394 tweets, Dev: 1,420 tweets Test Data 3,856 tweets Later Time Period No overlap in time period with Training/Dev data

49 2016 Test Data Annotation Simple Annotation Guidelines: Re-annotated 100 tweets from 2015 data Good agreement F-Score

2016 Test Data Annotation Simple Annotation Guidelines: http://bit.

50 2016 Test Data Annotation Simple Annotation Guidelines: Re-annotated 100 tweets from 2015 data Good agreement F-Score Frequent questions, discussion among the group BRAT annotation tool (

51 Annotation Interface

52 10 Participating Teams

53 Approaches All teams use some form of machine supervised ML Many LSTM-based approaches as compared to last year Unique approaches: CambridgeLTL, Talos

54 Results (10 types)

55 Results (No Types)

56 Domain-Specific Data Cybersecurity (350 Tweets) Gun Violence (500 Tweets)

57 Results (Cyber-Domain)

58 Results (Shooting-Domain)

59 Summary classification (Naïve Bayes) Regular Expression Language Identification Tokenization Part-of- Speech (POS) Tagging Shallow Parsing (Chunking) Named Entity Recognition (NER) Stemming Sequential Tagging Normalization Alan Ritter socialmedia-class.org

60 Alan Ritter socialmedia-class.org Presentation 2

GATE and Social Media: Normalisation and PoS-tagging

GATE and Social Media: Normalisation and PoS-tagging Leon Derczynski Kalina Bontcheva The University of Sheffield, 1995-2016 This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs