Lecture11b: NLP (Introduction)

Size: px

Start display at page:

Download "Lecture11b: NLP (Introduction)"

Emerald Hunt
5 years ago
Views:

Lecture11b: NLP (Introduction) CS540 4/10/18 Announcements Project #1 Group grades have been mailed out Individual grades are posted on canvas similar to group grades if teams didn t turn in columns,

Pustejovsky & Marc Verhagen of Brandeis. Mistakes are undoubtably mine.

1 Lecture11b: NLP (Introduction) CS540 4/10/18 Announcements Project #1 Group grades have been mailed out Individual grades are posted on canvas similar to group grades if teams didn t turn in columns, I assumed balanced, but next time An extra week for Project #2 Having to make a teaming change Code freeze is now April 17 Papers due April 24 Material borrowed (with permission) from James Pustejovsky & Marc Verhagen of Brandeis. Mistakes are undoubtably mine. NLP Natural Language Processing is a big topic Computational Linguistics Retrieval Data Mining Motivation of Bringing Structure to Text q The prevalence of unstructured data q Structures are useful for knowledge discovery Too expensive to be structured by human: Automated & scalable Artificial Intelligence We will start with overview of NLP for IR Contains all the key components of NLP We can dive into subtopics on demand Up to 85% of all information is unstructured -- estimated by industry analysts Vast majority of the CEOs expressed frustration over their organization s inability to glean insights from available data -- IBM study with1500+ CEOs 4 Overload: A Critical Problem in Big Data Era Example: Research Publications By 2020, information will double every 73 days -- G. Starkweather (Microsoft), 92 Every year, hundreds of thousands papers are published Unstructured data: paper Loosely structured entities: authors, venues growth venue Unstructured or loosely structured data are prevalent papers author 5 6 1

Example: News Articles Every day, >90,000 news articles are produced Unstructured data: news content Extracted entities: persons, locations, organizations, Example: Social Media Every second, >150K

URL The White House 7 8 Useful Structure from Text: Phrases, Topics, Entities Pipeline of NLP Tools q Top 10 active politicians and phrases regarding healthcare issues?

2 Example: News Articles Every day, >90,000 news articles are produced Unstructured data: news content Extracted entities: persons, locations, organizations, Example: Social Media Every second, >150K tweets are sent out Unstructured data: tweet content Loosely structured entities: twitters, hashtags, URLs, location organization news twitter hashtag tweets Darth Vader #maythefourthbewithyou person URL The White House 7 8 Useful Structure from Text: Phrases, Topics, Entities Pipeline of NLP Tools q Top 10 active politicians and phrases regarding healthcare issues? Entities entity Topics (hierarchical) q Top 10 researchers and phrases in data mining and their specializations? Phrases Scraping (not covered here) Sentence splitting Tokenization (Stemming) Part-of-speech tagging Shallow parsing Named entity recognition Syntactic parsing (Semantic Role Labeling) 9 10 lexicon ontology Sentence splitting raw (unstructured) Natural Language Processing part-of-speech named entity tagging recognition deep syntactic parsing annotated (structured) Current immunosuppression protocols to prevent lung transplant rejection reduce pro-inflammatory and T-helper type 1 (Th1) cytokines. However, Th1 T-cell proinflammatory cytokine production is important in host defense against bacterial infection in the lungs. Excessive immunosuppression of Th1 T-cell pro-inflammatory cytokines leaves patients susceptible to infection Secretion of TNF was abolished by BHA in PMA-stimulated U937 cells... NP S VP VP PP Current immunosuppression protocols to prevent lung transplant rejection reduce pro-inflammatory and T-helper type 1 (Th1) cytokines. NP PP PP NP NN IN NNS NN IN VBZ VBN IN NN JJ NN. Secretion of TNF was abolished by BHA in PMA-stimulated U937 cells. protein_molecule organic_compound cell_line negative regulation 11 However, Th1 T-cell pro-inflammatory cytokine production is important in host defense against bacterial infection in the lungs. Excessive immunosuppression of Th1 T-cell pro-inflammatory cytokines leaves patients susceptible to infection. 12 2

3 A heuristic rule for sentence splitting Errors sentence boundary = period + space(s) + capital letter Regular expression in Perl s/\. +([A-Z])/\.\n\1/g; IL-33 is known to induce the production of Th2-associated cytokines (e.g. IL-5 and IL-13). IL-33 is known to induce the production of Th2-associated cytokines (e.g. IL-5 and IL-13). Two solutions: Add more rules to handle exceptions Machine learning Tokenization The protein is activated by IL2. The protein is activated by IL2. Convert a sentence into a sequence of tokens Tokenization Issues separate possessive endings or abbreviated forms from preceding words: Mary s Mary s Mary s Mary is Mary s Mary has separate punctuation marks and quotes from words : Mary. Mary. new new Why do we tokenize? Because we do not want to treat a sentence as a sequence of characters! Tokenization problems Commas 2,6-diaminohexanoic acid tricyclo( ,7)decanone Four kinds of hyphens Syntactic: Calcium-dependent Hsp-60 Knocked-out gene: lush-- flies Negation: -fever Electric charge: Cl- K. Cohen NAACL-2007 Indexing over Problem Objects Representation - search formulation Representation - indexing Query (q i ) Surrogate (D) Comparison

4 document Full à Index terms accents spacing etc. stopwords noun groups stemming automatic or manual indexing Automatic Indexing Choose from the terms in a document those which are most indicative of its content. contrast with full- retrieval For non-boolean retrieval include weights with terms (more later). + structure structure recognition structure full index terms 20 Normalizing terms Should numbers, units ( km/h ), etc. be included? Should traffic and Traffic be one term? Should compute, computer, computation, computerisation be all one term? Stemming is the process of removing suffixes so that these are all mapped to comput Word frequency characteristics Zipf s Law: rank * frequency» constant (Most frequent word twice as common as second most frequent, three times as common as the third most frequent, etc.) Statistical Indexing - Basis Frequent words are important content representation words. except content-free function words like the, and, or, but, of, in, it, he, middle-frequency words are the best for indexing documents. (?) Basic Indexing Strategy 1 list the unique words in the documents 2 remove stopwords (about 250 for English) 3stem remaining words (improves recall) 4 assign as index terms either A - all resulting terms or B - all but very rare terms (they won t retrieve much) C - terms that are most frequent in the doc. D - terms weighted highly by other measures

5 Surrogate/Query Comparison Problem Objects Representation - search formulation Representation - indexing Query (q i ) Surrogate (D) Comparison Implementation Details Indexing results in records like Doc12: napoleon, france, revolution, emperor or (weighted terms) Doc12: napoleon-8, france-6, revolution-4, emperor-7 To find all documents about Napoleon would involve looking at every document s index record (possibly 1,000s or millions). (Assume that Doc12 references another file which contains other details about the document.) Implementation - Inverted Files Instead the information is inverted : napoleon : doc12, doc56, doc87, doc99 or (weighted) napoloen : doc12-8, doc56-3, doc87-5, doc99-2 inverted file contains one record per index term inverted file is organized so that a given index term can be found quickly. Inverted File (on Tokens) Inverted file: a list of the tokens in a set of documents and the documents in which they appear. Word Document abacus 3 22 actor 2 29 aspen 5 atoll Stop words are removed before building the index. 27 Keywords and Controlled Vocabulary Keyword: A term that is used to describe the subject matter in a document. It is sometimes called an index term. Keywords can be extracted automatically from a document or assigned by a human cataloguer or indexer. Controlled vocabulary: A list of words that can be used as keywords, e.g., in a medical system, a list of medical terms. Inverted file (more complete definition): A list of the keywords that apply to a set of documents and the documents in which they appear. Enhancements to Inverted Files Location: The inverted file holds information about the location of each term within the document. Uses adjacency and near operators user interface design -- highlight location of search term Frequency: The inverted file includes the number of postings for each term. Uses term weighting query processing optimization user interface design 5

6 Inverted File (Enhanced) Organization of Inverted Files Word Postings Document Location Index (vocabulary) file Postings file Documents file abacus actor aspen atoll Term ant bee cat dog elk fox gnu hog Pointer to postings Inverted lists Efficiency Criteria Index File Storage Inverted files are big, typically 10% to 100% the size of the collection of documents. Update performance It must be possible, with a reasonable amount of computation, to: (a) Add a large batch of documents (b) Add a single document Retrieval performance Retrieval must be fast enough to satisfy users and not use excessive resource. If an index is held on disk, search time is dominated by the number of disk accesses. Suppose that an index has 1,000,000 distinct terms. Each index entry consists of the term and a pointer to the inverted list, average 100 characters. Size of index is 100 megabytes, which can easily be held in memory. Postings File Efficiency and Query Languages Since inverted lists may be very long, it is important to match postings efficiently. Usually, the inverted lists will be held on disk. Therefore algorithms for matching posting use sequential file processing. For efficient matching, the inverted lists should all be sorted in the same sequence, usually alphabetic order, "lexicographic index". Merging inverted lists is the most computationally intensive task in many information retrieval systems. Some query options may require huge computation, e.g., Regular expressions If inverted files are stored in alphabetical order, comp* can be processed efficiently *comp cannot be processed efficiently Boolean terms If A and B are search terms A or B can be processed by comparing two moderate sized lists (not A) or (not B) requires two very large lists 6

7 Lexeme, Lexicon & Lemma Lexeme: Smallest unit of language which has a meaning (roughly dictionary entry), e.g. run Takes various inflected word forms, e.g. runs, running, ran conduct (verb) is a different lexeme from conduct (noun) Lexicon: A finite set of lexemes (roughly dictionary) Lemma: The canonical or basic form that represents the lexeme, e.g. run Lemmatization The process of mapping word forms to their lemmas, e.g. running à run Typically done using morphological analysis Often done in NLP to avoid data sparsity, but depending on the application sometimes it may be best to keep the word forms Lemmatization is not Trivial May depend on the con He found the ball à find He will found the Institute à found Depends on the part of speech He conducted the orchestra à conduct (verb) Stemming The removal of the inflectional ending from words (strip off any affixes) Laughing, laugh, laughs, laughedà laugh Problems Can conflate semantically different words Gallery and gall may both be stemmed to gall A further step is to make sure that the resulting form is a known word in a dictionary, a task known as lemmatization. 40 Porter Stemmer Lexicon free stemmer Rewrite rules ATIONAL à ATE (e.g. relational, relate) FUL à ε (e.g. hopeful, hope) SSESà SS (e.g. caresses, caress) Errors of Commission Organizationà organ Policy à police Errors of Omission Urgency (not stemmed to urgent) European (not stemmed to Europe) Is stemming useful? For IR, some improvement especially for smaller documents Helps on average, but not a lot Word sense disambiguation on query terms: business may be stemmed to busy, saw (the tool) to see Most studies for stemming for IR done for English may help more for other languages The possibility of letting people interactively influence the stemming has not been studied much Improved by using a dictionary If stem is not in dictionary, use original word Often called lemmatization when a dictionary is used

Text mining tools for semantically enriching the scientific literature

Text mining tools for semantically enriching the scientific literature Sophia Ananiadou Director National Centre for Text Mining School of Computer Science University of Manchester Need for enriching the