* Overview. Ontology-Guided Information Extraction from Pathology Reports The SWPatho Project David Schlangen Universität Potsdam

Size: px

Start display at page:

Download "* Overview. Ontology-Guided Information Extraction from Pathology Reports The SWPatho Project David Schlangen Universität Potsdam"

Horace Davis
6 years ago
Views:

Overview Background of project The task The system Digression: gently machine aided ontology construction Evaluation Future Work -Guided Information Extraction from Pathology Reports The SWPatho

1 Overview Background of project The task The system Digression: gently machine aided ontology construction Evaluation Future Work -Guided Information Extraction from Pathology Reports The SWPatho Project David Schlangen Universität Potsdam (with Manfred Stede, Elena Paslaru-Bontas, et al.) Charité Berlin: digital pathology Charité Berlin: digital pathology retrieval of images (via textual descr.) statistics quality control FU Berlin: web technol. FU Berlin: web technol. Service Service Charité Berlin: digital pathology use of ontologies in robust processing () () Charité Berlin: digital pathology retrieval of images (via textual descr.) statistics quality control Expert Knowledge FU Berlin: web technol. Expert Knowledge FU Berlin: web technol. reasoning with SW rule languages () () Uni Potsdam: nice task use of ontologies in robust processing Expert Knowledge Uni Potsdam: nice task use of ontologies in robust processing retrieval of images (via textual descr.) statistics quality control reasoning with SW rule languages Uni Potsdam: nice task retrieval of images (via textual descr.) statistics quality control reasoning with SW rule languages * Overview reasoning with SW rule languages Service Service () () Uni Potsdam: nice task use of ontologies in robust processing 1

2 Charité Berlin: digital pathology retrieval of images (via textual descr.) statistics quality control FU Berlin: web technol. reasoning with SW rule languages Service () Charité Berlin: digital pathology retrieval of images (via textual descr.) statistics quality control FU Berlin: web technol. reasoning with SW rule languages Creation supervises () supports Uni Potsdam: nice task use of ontologies in robust processing Uni Potsdam: nice task use of ontologies in robust processing Desiderata: retrieval of images (via textual descr.) statistics quality control indexing, NER IE annotation 2

3 extracted_from contains 12456makro.xml 12456makro.xml identify concepts identify concepts and relations. extracted_from contains THING THING? contains? extracted_from contains 3

4 a few words about our corpus: very tersely formulated ("telegramm style"), NP-heavy. e.g., instead of: "This is a lung with 10x20x30mm volume that contains some small traces of cancer cells" we would have "lung, 10x20x30mm, {with} traces of cancer cells" a few words about our corpus: why elliptical? because these are "answers" to obvious implicit questions: : what do you see? microscopy: what do you see? critical : what do you think this indicates? (see (Schlangen & Lascarides, 2002; Schlangen 2003) on fragmental replies in dialogue) * Overview Overview Background of project The task The system Digression: gently machine aided ontology construction Evaluation Future Work Morphology: FS-based (weighted automata) ~ entries for nouns (German Dictionary Project) we added about specific entries fairly deep analysis, decomposition of compounds, etc. 4

5 Gefäßanschnitte: 4 Analyse(n) Gefäß(N)#Anschnitt [NN Gender=masc Number=pl Case=acc] Gefäß(N)#Anschnitt [NN Gender=masc Number=pl Case=gen] Gefäß(N)#Anschnitt [NN Gender=masc Number=pl Case=nom] Gefäß(N)#Anschnitt [NN Gender=masc Number=sg Case=dat] mit: 3 Analyse(n) leichter: 7 Analyse(n) mit[adv] leicht [ADJA Degree=pos Number=pl Case=gen mit[appr] Gender=* ADecl=strong] mit[ptkvz] leicht [ADJA Degree=pos Number=sg Case=dat Gender=fem ADecl=strong]. leicht [ADJC Degree=comp] leichter~n [VVIMP Number=sg] POS-tagger / disambiguator: -based trained on NEGRA corpus (newspaper text) identifies most likely path through analyses of Gefäßanschnitte: 4 Analyse(n) Gefäß(N)#Anschnitt [NN Gender=masc Number=pl Case=acc] Gefäß(N)#Anschnitt [NN Gender=masc Number=pl Case=gen] Gefäß(N)#Anschnitt [NN Gender=masc Number=pl Case=nom] Gefäß(N)#Anschnitt [NN Gender=masc Number=sg Case=dat] mit: 3 Analyse(n) leichter: 7 Analyse(n) mit[adv] leicht [ADJA Degree=pos Number=pl Case=gen mit[appr] Gender=* ADecl=strong] mit[ptkvz] leicht [ADJA Degree=pos Number=sg Case=dat Gender=fem ADecl=strong]. leicht [ADJC Degree=comp] leichter~n [VVIMP Number=sg] Chunk parser: written in PROLOG simple chart parser "HPSG-inspired": feature geometry feature principles Chunk parser: produces repr. that encodes dependencies: mit nekrotisierenden Zellen. <ep_ent type="" inst="tid28"/> <ep_ent type="zelle" inst="tid31"/> <ep_prop type="nekrotisier/vd" arg="tid31"/> <ep_prep type="mit" arg="tid28" arg2="tid31"/> Chunk parser: produces repr. that encodes dependencies. uses some specific / constructions (e.g., for measure phrases, for handling certain idiomatic constructions, ) 5

6 Lookup / Parse Disambiguation: connects lemmata to ontology: <ep_ent type="" inst="tid28" cid=" disambiguation, main idea: use lookup success to distinguish between parses (the more that can be mapped, the better the parse / use the parse that "makes sense") Lookup / Parse Disambiguation: foreach N: check in ontology; if unsuccessful: is it compound noun? if yes, lookup parts. (E.g.: "nflügel" -> "", "Flügel", associated_with); if this also unsuccessful, return T ( owl:thing ). Lookup / Parse Disambiguation: foreach ADJ (given N) lookup ADJ & test whether N is in its if so, increase score for this parse When can this disambiguate? appositions! "Bronchusstück mit Entzündung, nekrotisierend" [ piece of bronchus with inflammation, nekrotising ] Lookup / Parse Disambiguation: foreach P (given N1 and N2) lookup frame for P, test whether N1 & N2 are of right type if so, increase score for this parse when can this disambiguate? PP attachment ambiguity: N PP PP Lookup / Parse Disambiguation: example: "mit" (with) has_part: Bronchus mit Alveolarzellen Instantiator: connects individuals to document-related entities (sections of text, token IDs, etc.) ffected_by: Bronchus mit Entzündung process 6

7 * Overview * Evaluation Overview Background of project The task The system Evaluation Digression: gently machine aided ontology construction Future Work preliminary! modules are still being improved: grammar ontology frames for Ps * Digression: OntoSeed * Evaluation: The "gently machine-aided ontology construction" term extraction via tf.idf (with a twist): s # hits google compound noun decomposition via simple clustering (ODBase 2005; WebS 2005) * Evaluation: morph, tag, parse Morphology / POS-Tagger: accuracy: 93.7% Chunk parser: avg. length of chunks: 2.78 tokens coverage: 68.2% of input chunks per gold NP: 1.61 % of analyses that are correct structures: 88% Lookup: nouns: (f-measure: 0.92) CIDs from Gold partial match full match 7

8 Lookup, coverage of ontology nouns: Lookup, "added value" nouns: 18% found in ont 55% 45% Thing w/ known prop w/ any prop just Thing 45% 31% found in ont "Thing" 6% Lookup: adjs: Lookup, PP attachment & apposition attachment ambiguity , from gold from all ADJ onto. based heuristics Lookup, PP attachment & apposition attachment ambiguity no ambi attach ambi 9,77 Lookup, PP attachment & apposition attachment ambiguity no ambi attach ambi, all cids know some cids missing 6% 4% 90,23 90% 8

9 * Conclusions * Future Work annotation / ontology population tight integration with ontology: possible information gain through keeping unknown concepts in results (& as relata) in : shows some promise (improvement over heuristics (but what about frequency info?)) costly needs very detailed ontology improve modules & ontology notion of likelihood of reading port to different (tourism) evaluation: does search actually outperform full text search? user testing *** The End! *** Thank you for your attention! Acknowledgments: funded by DFG; thanks to Bryan Jurish and Sebastian Maar for coding support 9

Benedikt Perak, * Filip Rodik,

Benedikt Perak, * Filip Rodik, Building a corpus of the Croatian parliamentary debates using UDPipe open source NLP tools and Neo4j graph database for creation of social ontology model, text classification and extraction of semantic