University of Sheffield, NLP GATE: Bridging the Gap between Terminology and Linguistics

Size: px

Start display at page:

Download "University of Sheffield, NLP GATE: Bridging the Gap between Terminology and Linguistics"

Naomi Washington
6 years ago
Views:

1 GATE: Bridging the Gap between Terminology and Linguistics Diana Maynard University of Sheffield, UK

2 Why do terminologists need GATE? Terminologists face the problem of lack of suitable tools to process their data. Lots of in-house tools for doing individual things Lack of common tools that can be used collaboratively and across different systems and domains. Tools must be flexible, robust and able to adapt to different processing tasks and languages GATE and its components are a key tool in today's world of information and data overload Enable users to perform tasks such as document management, business intelligence, information retrieval, question answering, and knowledge indexing, modelling and conceptualisation.

3 GATE can help terminologists: Save time and money on management of text and data from multiple sources Find hidden links scattered across huge volumes of diverse information Integrate structured data from variety of sources Interlink text and data Collect information and extract new facts

4 A vision for text mining It is difficult to access unstructured information efficiently IE automates extraction of facts from text at reasonable accuracy and cost, increasing the value and utility of unstructured content Interlinking of text and data enables more efficient search, navigation and querying Text analysis is a matter of engineering: GATE offers practical solutions able to match specific requirements

5 Threat tracking application

6 Text mining and semantic annotation Extract structured data from text by Linking references to entities Linking entities to their semantic descriptions Automatic semantic annotation based on IE technology Attaches metadata to documents, which can be used for searching and hyperlinking Adds value to content of libraries, enabling user interaction with content Enhanced capability for cross-referencing and dynamic document classification

7 Semantic Annotation

8 Semantic Annotation of Entities Recognition of the type of the entities in the text from a rich taxonomy of classes Reference to their semantic description. Traditional NE recognition approach results in: <Person>Lama Ole Nydahl</Person> Semantic Annotation of NEs results in: <ReligiousPerson ID= > Lama Ole Nydahl </ReligiousPerson>

GATE: the Swiss Army Knife of NLP Has an attachment for almost every eventuality Some are hard to prise open Some are useful, but you might have to put up with a bit of

9 GATE: the Swiss Army Knife of NLP Has an attachment for almost every eventuality Some are hard to prise open Some are useful, but you might have to put up with a bit of clunkiness in practice Some will only be useful once in a lifetime, but you're glad to have them just in case. There are many imitations, but nothing like the real thing.

10 History of GATE early 1990s: you want me to write that all over again? : first GATE (and "large-scale IE") project 1996: GATE 1: Tcl/Tk, Perl, C++, : release of completely rewritten version 2, 100% Java 2009: mature ecosystem with established community Tens of thousands of research users 25,000 downloads per year commercial users getting serious

11 GATE is very eco-friendly!

12 GATE commercial users Typical commercial uses: dynamic search and indexing of repositories finding relations between elements in distributed repositories aggregating information from different text sources populating repositories fact finding from distributed knowledge sources Typical users: Pharmaceutics, news, intelligence (business, competitor, government, etc.), manufacturing, telecommunications

21 So what exactly is GATE? An architecture: A macro-level organisational picture for HLT software systems. A framework: For programmers, GATE is an object-oriented class library that implements the architecture. A development environment: For language engineers, computational linguists et al, a graphical development environment. A community of users and contributors

22 Architectural principles Non-prescriptive, theory neutral (strength and weakness) Re-use, interoperation, not reimplementation (e.g. diverse XML support, integration of Protégé, Jena, Yale...) (Almost) everything is a component, and component sets are user-extendable (Almost) all operations are available both from API and GUI

23 In short GATE includes: components for language processing, e.g. parsers, machine learning tools, stemmers, IR tools, IE components for various languages... tools for visualising and manipulating text, annotations, ontologies, parse trees, etc. various information extraction tools evaluation and benchmarking tools

24 Algorithms + Data + GUI = Applications GATE components are one of three types: Language Resources (LRs), e.g. lexicons, corpora, ontologies Processing Resources (PRs), e.g. parsers, generators, taggers Visual Resources (VRs), i.e. visualisation and editing components Algorithms are separated from the data, which means: the two can be developed independently by users with different expertise. alternative resources of one type can be used without affecting the other, e.g. a different visual resource can be used with the same language resource

25 But isn t GATE just about IE? Many people think of GATE as an IE tool IE is its primary function, but it also does a lot more Pretty much kind of linguistic processing can be done in GATE The only field we really don't cover is Machine Translation, but you could easily add components for that if you wanted More about the other functionality later, but now back to IE...

26 Two Approaches to IE Knowledge Engineering rule based developed by experienced language engineers make use of human intuition obtain marginally better performance development could be very time consuming some changes may be hard to accommodate Learning Systems use statistics or other machine learning developers do not need LE expertise requires large amounts of annotated training data some changes may require re-annotation of the entire training corpus

27 Named Entity Recognition Named Entity recognition is the cornerstone of IE Identification of proper names in texts, and classification into a set of predefined categories of interest. Three universally accepted categories: person, location and organisation Other common tasks: recognition of date/time expressions, measures (percent, money, weight etc), addresses etc. Other domain-specific entities: names of drugs, medical conditions, names of ships, bibliographic references etc.

28 ANNIE ANNIE is GATE's rule-based IE system It uses the language engineering approach (though we also have tools in GATE for ML) Distributed as part of GATE Uses a finite-state pattern-action rule language, JAPE More on JAPE later... ANNIE contains a reusable and easily extendable set of components: generic preprocessing components for tokenisation, sentence splitting etc components for performing NE on general open domain text

29 ANNIE Modules

30 Unicode Tokeniser Bases tokenisation on Unicode character classes Language-independent tokenisation Declarative token specification language, e.g.: "UPPERCASE_LETTER" LOWERCASE_LETTER"* > Token; orthography=upperinitial; kind=word Identifies words, numbers, spaces, different classes of punctuation, orthography Recognition deliberately basic so that more powerful tools (JAPE) can be used for finer distinctions greater reuse possibilities

31 Gazetteer Set of lists compiled into Finite State Machines 60k entries in 80 types List entries are matched in the text as Lookup annotations Each list has some pre-defined features, which enable different kinds of matches to be identified Additional arbitrary features and values can be added to individual list entries Entries can be matched according to root forms, or more flexibly based on e.g. edit distance

33 Limitations of gazetteers Gazetteer lists are designed for annotating simple, regular features Some flexibility is provided, but this is not enough for most tasks Recognising addresses using just a gazetteer would be impossible But combined with other linguistic pre-processing results, we have a whole lot of annotations and features POS tags, capitalisation, punctuation, lookup features, etc can all be combined to form patterns suggesting more complex information Luckily, we have JAPE to take care of this.

34 What is JAPE? a Jolly and Pleasant Experience Specially developed pattern matching language for GATE Each JAPE rule consists of LHS which contains patterns to match RHS which details the annotations (and optionally features) to be created JAPE rules combine to create a phase Rule priority based on pattern length, rule status and rule ordering Phases combine to create a grammar

35 Named Entity Grammars Hand-coded rules written in JAPE applied to annotations to identify NEs Phases run sequentially and constitute a cascade of FSTs over annotations Annotations from format analysis, tokeniser. splitter, POS tagger, morphological analysis, gazetteer etc. Because phases are sequential, annotations can be built up over a period of phases, as new information is gleaned Standard named entities: persons, locations, organisations, dates, addresses, money Basic NE grammars can be adapted for new applications, domains and languages

36 JAPE example University of Sheffield Rule: nameduniversity ( {Token.string == "University"} {Token.string == "of"} ({Lookup.minorType == city} ({Token.category == NNP})+ ) ):orgname --> :orgname.organisation = {kind = "university", rule = "nameduniversity"} Looks for specific words University of followed by: city name from gazetteer, or one or more proper nouns

37 Combining existing annotations Associate a company with a share price e.g. Whitbread shares closed up 2p at 645p. Phase: Shares Input: Token Organization Lookup Money Percent Options: control = appelt Rule:ShareChange ( {Organization} ({Token})[0,3] {Lookup.majorType=="change"} ({Token})[0,3] ({Money} {Percent}) ):change --> :change.sharechange = {rule = "ShareChange"}

38 Orthomatcher Orthographic coreference between annotations in the same document, e.g. Mr Brown, James Brown Matching rules are invoked between annotations of the same type, or between an existing annotation and an Unknown annotation The latter is the only case where an annotation type can be changed Lookup tables of aliases and exceptions (i.e. overriding of matching rules) Also PRs for pronominal and nominal coreference

39 What about other languages? Since we're based in Sheffield, you can't blame us for developing GATE primarily for English But contrary to popular belief about the British, we don't hate all foreigners! And we have lots of capabilities for processing in other languages Currently systems for English, French, German, Romanian, Bulgarian, Russian, Cebuano, Hindi, Chinese, Arabic You have a POS tagger for Swahili? Just add it as a plugin and combine it with existing tokeniser etc.

40 It's all Chinese to me...

41 Processing multiple languages If you have a language identifier PR, you can combine processing of texts in different languages in a single application The system will choose the right PRs for each document or document section Conditional application fires a PR if some condition is met

42 Other plugins Parsers (Stanford, MiniPar, RASP, SUPPLE) More flexible gazetteers Specialised NE (Chemistry, Biomedicine, etc) PRs for other languages, Alignment Lemmatisers, morphological analyser, NP and VP chunkers Machine Learning Evaluation toolkit including IAA IR, Google and Yahoo search engines, web crawlers WordNet Whole host of ontology-based tools

43 Alignment plugin

44 GATE in use We have dozens of applications, not all just research projects! A few examples...

45 Semantic Annotation Adding information to documents that is usable by machines to enable better presentation, navigation or searching, e.g. Perseus:

47 Indexing news at the BBC BBC Archives: 'Newsnight' archiving time is 8 hours per hour Automatic transcription to extract some potential indexing terms Result: temporally precise, but very noisy data Partial solution: search the web, intranet, digital library for related pages, and process with IE/SA Result: less noisy but temporally imprecise So we merge this information with the speech signal data Result: works well for easy stuff (high precision, low recall)

49 Ontology linking at FAO FAO have sets of fisheries-related ontologies, e.g. Gear, species, fishing areas No way to link between them using ontology alignment techniques, because we require information external to the ontology (fish lives in a particular area) NLP techniques make use of information from documents which provide this missing link Not always an exact match between text and the ontology elements, e.g. Mummichogs vs. fundulus heteroclitus Use techniques such as headword matching, noun phrase chunking, synonym and acronym finding, etc Find relations in the text to link the entities together

50 Ontology linking at FAO Fishing Gear Fishing Area caught_by found_in Species basis_of Commodities

51 Matching text descriptions Find NPs and terms; use OntoRootGazetteer to find morphological variants of ontology elements, perform headword and synonym matching etc. Pelagic species, mainly fish and cephalopds, northern shrimp (also small crustaceans, krill Match text span to ontology instance, retaining URIs Create annotations and features, e.g. caught_by = {gear_type = midwater otter trawls target_species = cephalopods} Convert to RDF triples

53 Using ANNIC to view results 53

54 Outsmarting our competitors

55 If you can't beat 'em, join 'em UIMA OpenCalais Lingpipe All integrated into GATE as plugins

56 UIMA UIMA is an NL engineering platform developed by IBM Shares some functionality with GATE, but is complementary in most respects. Interoperability layer has been developed to allow UIMA applications to be run within GATE, and vice versa, in order to combine elements of both. Emphasis is on architectural support, including asynchronous scaleout (deploying many copies of an application in parallel) Much narrower range of resources provided than GATE

57 OpenCalais Web service for semantic annotation of text. The user submits a document to the web service, which returns entity and relations annotations in RDF, JSON or some other format. Typically, users integrate OpenCalais annotation of their web pages to provide additional links and semantic functionality. OpenCalais annotates both relations and entities, although the GATE plugin only supports entities.

58 LingPipe Provides set of IE and data mining tools largely MLbased. Has a set of models trained for particular tasks/corpora. Limited ontology support: can connect entities found to databases and ontologies Advantage: ML models can suggest more than one output, ranked by confidence. The user can choose number of suggestions generated. Disadvantage: ML models only apply to specific tasks and domains.

59 In summary... We like to think GATE is the best thing since sliced bread for most NLP and terminology tasks You can use it for plenty of other things too, don't let us stop you being creative! Incorporates huge number of plugins, is easily extendable and highly customisable The only limit is your imagination... So if you're now convinced you can't live without GATE, there are two possibilities: ask us to get involved with a project try GATE yourself

60 Get your own hands dirty We run 3x yearly training courses in Sheffield and other selected locations Different tracks available GATE certification available

61 More info, contact details, demos, publications: Now it's time to nudge your neighbour if they are asleep... Or ask that burning question about GATE.

BD003: Introduction to NLP Part 2 Information Extraction

BD003: Introduction to NLP Part 2 Information Extraction The University of Sheffield, 1995-2017 This work is licenced under the Creative Commons Attribution-NonCommercial-ShareAlike Licence. Contents This