Large-scale, Parallel Automatic Patent Annotation

Size: px

Start display at page:

Download "Large-scale, Parallel Automatic Patent Annotation"

Tyrone Newton
6 years ago
Views:

1 Overview Large-scale, Parallel Automatic Patent Annotation Thomas Heitz & GATE Team Computer Science Dept. - NLP Group - Sheffield University Patent Information Retrieval October 2008 T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 1 / 33

2 Overview Automatic Patent Annotation Task Approach Results In the following Objectives Fully automatic method. Scaling up without sacrificing computational performance and accuracy. Methods Keywords based queries: 10 degree, 20 degree Celsius, 18 F, etc. Semantic annotations based queries: measurement.unit = degree Celsius, measurement.value = {10,30}; will find Fahrenheit equivalent as well. T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 2 / 33

3 Overview Automatic Patent Annotation Task Approach Results In the following Objectives Fully automatic method. Scaling up without sacrificing computational performance and accuracy. Methods Keywords based queries: 10 degree, 20 degree Celsius, 18 F, etc. Semantic annotations based queries: measurement.unit = degree Celsius, measurement.value = {10,30}; will find Fahrenheit equivalent as well. T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 2 / 33

4 Overview Task Approach Results In the following Large-scale parallel Information Extraction System characteristics Insufficient training data for learning Rule-Based system Robust, Scalable Shallow IE (Deep in PatExpert [16]). Large volume of data Automatic and Parallel T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 3 / 33

5 Results Overview Task Approach Results In the following Performance and quality Processed 1.3 million patents in 6 days with 12 parallel processes. Strict precision and recall greater than 90% for most annotations. T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 4 / 33

6 Results Overview Task Approach Results In the following Performance and quality Processed 1.3 million patents in 6 days with 12 parallel processes. Strict precision and recall greater than 90% for most annotations. T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 4 / 33

Contents Overview Task Approach Results In the following 1 Task: patent annotation 2 3 4 T.

7 Contents Overview Task Approach Results In the following 1 Task: patent annotation T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 5 / 33

8 Contents Overview Task Approach Results In the following 1 Task: patent annotation T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 5 / 33

9 Contents Overview Task Approach Results In the following 1 Task: patent annotation T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 5 / 33

10 Contents Overview Task Approach Results In the following 1 Task: patent annotation T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 5 / 33

11 Contents Task: patent annotation Patent data and structure Section annotations Reference annotations Measurement annotations 1 Task: patent annotation T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 6 / 33

12 Task: patent annotation Patent data and structure Patent data and structure Section annotations Reference annotations Measurement annotations Dataset from Matrixware American patents (USPTO): 1.3 million, 108 GB, average file size is 85KB. European patents (EPO): 27 thousand, 780MB, average file size is 29KB. Structure in three main parts The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the claim part and the bibliography part. T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 7 / 33

13 Task: patent annotation Patent data and structure Patent data and structure Section annotations Reference annotations Measurement annotations Dataset from Matrixware American patents (USPTO): 1.3 million, 108 GB, average file size is 85KB. European patents (EPO): 27 thousand, 780MB, average file size is 29KB. Structure in three main parts The first page containing bibliographical data and abstract, the description of the invention, the usage of the invention, the claim part and the bibliography part. T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 7 / 33

14 Task: patent annotation Section annotations (EPO) Patent data and structure Section annotations Reference annotations Measurement annotations T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 8 / 33

15 Task: patent annotation Section annotations Patent data and structure Section annotations Reference annotations Measurement annotations Sections BibliographicData, Abstract and Claims sections pre-existing. heading annotations gives the beginning of a section, if present. Use of keywords to guess the section type. About 20 section types. T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 9 / 33

16 Task: patent annotation Reference annotations (USPTO) Patent data and structure Section annotations Reference annotations Measurement annotations T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 10 / 33

17 Task: patent annotation Reference annotations Patent data and structure Section annotations Reference annotations Measurement annotations References Claim, Example, Figure, Formula, Table are quite straightforward except for intervals like Fig. 1 to 3 and 5. A lot more difficult are Patent because of the variability of format. And even more Literature, for example authors can have numerous format: Warwel, S.; S. Warwel; Siegfried Warwel; etc. T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 11 / 33

18 Task: patent annotation Measurement annotations (EPO) Patent data and structure Section annotations Reference annotations Measurement annotations T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 12 / 33

19 Task: patent annotation Measurement annotations Patent data and structure Section annotations Reference annotations Measurement annotations Measurements Most measurements comprise a scalarvalue followed by a unit, e.g. 350 nm. Two scalarvalue with or without unit can be contained in an interval, e.g. 150 to 350 nm. Large number of measurement units in existence so we used an ontology populated from a database. One letter unit are ambiguous. T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 13 / 33

20 Contents Task: patent annotation GATE Gazetteers Rules Application 1 Task: patent annotation T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 14 / 33

21 GATE Task: patent annotation GATE Gazetteers Rules Application GATE and ANNIE GATE [5], the General Architecture for Text Engineering, is a framework providing support for a variety of language engineering tasks. It includes a vanilla information extraction system, ANNIE. The processing resources we use from ANNIE are as follows: tokeniser, completely customised gazetteer and finite state transduction grammars. T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 15 / 33

22 GATE Task: patent annotation GATE Gazetteers Rules Application GATE and ANNIE GATE [5], the General Architecture for Text Engineering, is a framework providing support for a variety of language engineering tasks. It includes a vanilla information extraction system, ANNIE. The processing resources we use from ANNIE are as follows: tokeniser, completely customised gazetteer and finite state transduction grammars. T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 15 / 33

23 Gazetteers Task: patent annotation GATE Gazetteers Rules Application Reference and measurement unit gazetteers The rules use some clue words like Table followed by a number for table references. We use gazetteers to annotate such clue words with all their inflections. For reference: 314 entries. For measurements unit: more than 30K entries (Created automatically from a database). T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 16 / 33

24 Gazetteers Task: patent annotation GATE Gazetteers Rules Application Reference and measurement unit gazetteers The rules use some clue words like Table followed by a number for table references. We use gazetteers to annotate such clue words with all their inflections. For reference: 314 entries. For measurements unit: more than 30K entries (Created automatically from a database). T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 16 / 33

25 Annotation rules Task: patent annotation GATE Gazetteers Rules Application GATE JAPE We use GATE JAPE rule that consists of two parts: left hand side (LHS) and right hand side (RHS). LHS consists of an annotation pattern that should be matched in the text. RHS declares the action that should be taken when the pattern specified in LHS is found in the document. T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 17 / 33

26 Annotation rules Task: patent annotation GATE Gazetteers Rules Application GATE JAPE We use GATE JAPE rule that consists of two parts: left hand side (LHS) and right hand side (RHS). LHS consists of an annotation pattern that should be matched in the text. RHS declares the action that should be taken when the pattern specified in LHS is found in the document. T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 17 / 33

27 Annotation rules Task: patent annotation GATE Gazetteers Rules Application To find a Measurement E.g. 350 nm. Measurement Annotation rule Rule: Measurement ( // LHS {Number} {Unit} ):match --> // RHS :match.measurement = {} T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 18 / 33

28 Annotation rules Task: patent annotation GATE Gazetteers Rules Application To find a Measurement E.g. 350 nm. Measurement Annotation rule Rule: Measurement ( // LHS {Number} {Unit} ):match --> // RHS :match.measurement = {} T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 18 / 33

29 Annotation rules Task: patent annotation GATE Gazetteers Rules Application To find a Measurement E.g. 350 nm. Measurement Annotation rule Rule: Measurement ( // LHS {Number} {Unit} ):match --> // RHS :match.measurement = {} T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 18 / 33

30 Annotation rules Task: patent annotation GATE Gazetteers Rules Application To find a Measurement E.g. 350 nm. Measurement Annotation rule Rule: Measurement ( // LHS {Number} {Unit} ):match --> // RHS :match.measurement = {} T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 18 / 33

31 Annotation rules Task: patent annotation GATE Gazetteers Rules Application To find a Measurement E.g. 350 nm. Measurement Annotation rule Rule: Measurement ( // LHS {Number} {Unit} ):match --> // RHS :match.measurement = {} T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 18 / 33

32 Annotation rules Task: patent annotation GATE Gazetteers Rules Application To find a Measurement E.g. 350 nm. In total, 30 rules are used for measurements. Measurement Annotation rule Rule: Measurement ( // LHS {Number} {Unit} ):match --> // RHS :match.measurement = {} T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 18 / 33

33 Annotation rules Task: patent annotation GATE Gazetteers Rules Application To find a literature reference E.g. see: Peacock, R. D. The Chemistry of Technetium and Rhenium Elsevier: Amsterdam, Literature Annotation rule Rule: Literature ( // LHS {LiteratureContext} ({LiteratureStart} {LiteratureEnd} ):match ):match-with-context --> // RHS :match.literature = {} T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 19 / 33

34 Annotation rules Task: patent annotation GATE Gazetteers Rules Application To find a literature reference E.g. see: Peacock, R. D. The Chemistry of Technetium and Rhenium Elsevier: Amsterdam, Literature Annotation rule Rule: Literature ( // LHS {LiteratureContext} ({LiteratureStart} {LiteratureEnd} ):match ):match-with-context --> // RHS :match.literature = {} T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 19 / 33

35 Annotation rules Task: patent annotation GATE Gazetteers Rules Application To find a literature reference E.g. see: Peacock, R. D. The Chemistry of Technetium and Rhenium Elsevier: Amsterdam, Literature Annotation rule Rule: Literature ( // LHS {LiteratureContext} ({LiteratureStart} {LiteratureEnd} ):match ):match-with-context --> // RHS :match.literature = {} T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 19 / 33

36 Annotation rules Task: patent annotation GATE Gazetteers Rules Application To find a literature reference E.g. see: Peacock, R. D. The Chemistry of Technetium and Rhenium Elsevier: Amsterdam, Literature Annotation rule Rule: Literature ( // LHS {LiteratureContext} ({LiteratureStart} {LiteratureEnd} ):match ):match-with-context --> // RHS :match.literature = {} T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 19 / 33

37 Annotation rules Task: patent annotation GATE Gazetteers Rules Application To find a literature reference E.g. see: Peacock, R. D. The Chemistry of Technetium and Rhenium Elsevier: Amsterdam, Literature Annotation rule Rule: Literature ( // LHS {LiteratureContext} ({LiteratureStart} {LiteratureEnd} ):match ):match-with-context --> // RHS :match.literature = {} T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 19 / 33

38 Annotation rules Task: patent annotation GATE Gazetteers Rules Application To find a literature reference E.g. see: Peacock, R. D. The Chemistry of Technetium and Rhenium Elsevier: Amsterdam, Literature Annotation rule Rule: Literature ( // LHS {LiteratureContext} ({LiteratureStart} {LiteratureEnd} ):match ):match-with-context --> // RHS :match.literature = {} T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 19 / 33

39 Annotation rules Task: patent annotation GATE Gazetteers Rules Application To find a literature reference E.g. see: Peacock, R. D. The Chemistry of Technetium and Rhenium Elsevier: Amsterdam, rules are used for references. Literature Annotation rule Rule: Literature ( // LHS {LiteratureContext} ({LiteratureStart} {LiteratureEnd} ):match ):match-with-context --> // RHS :match.literature = {} T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 19 / 33

40 Application Task: patent annotation GATE Gazetteers Rules Application Application pipeline Phase Gate processing resource 1 Section Finder 2 English Tokeniser 3 Patent-specific gazetteers 4 Reference Finder 5 Measurements Finder T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 20 / 33

41 Contents Task: patent annotation Setup Optimisation Performance 1 Task: patent annotation T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 21 / 33

42 Setup Task: patent annotation Setup Optimisation Performance Large Data Collider (LDC) Our experiments were carried out on the IRF s LDC with Java (jrockit-r jdk ) with up to 12 processes. SGI Altix 4700 system comprising 20 nodes each with four 1.4GHz Itanium cores and 18GB RAM. In comparison, we found it 4x faster on Intel Core 2 2.4GHz. Specific applications GATE batch mode: dispatches files to process on several GATE applications; do not stop on error. GATE benchmarking: generate time stamps for each resource and display charts from them. T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 22 / 33

43 Setup Task: patent annotation Setup Optimisation Performance Large Data Collider (LDC) Our experiments were carried out on the IRF s LDC with Java (jrockit-r jdk ) with up to 12 processes. SGI Altix 4700 system comprising 20 nodes each with four 1.4GHz Itanium cores and 18GB RAM. In comparison, we found it 4x faster on Intel Core 2 2.4GHz. Specific applications GATE batch mode: dispatches files to process on several GATE applications; do not stop on error. GATE benchmarking: generate time stamps for each resource and display charts from them. T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 22 / 33

44 Optimisation Task: patent annotation Setup Optimisation Performance Benchmarking and refactoring Benchmarking of each processing resources. Removing of unnecessary resources like ANNIE Morphological analyser and Named Entities Recognition to keep only the Tokenizer. Optimisation of the JAPE rules where the benchmarking detect abnormal execution time. T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 23 / 33

45 Performance Task: patent annotation Setup Optimisation Performance Baseline vs. optimized T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 24 / 33

46 Contents Task: patent annotation Patent Gold Standard Evaluation on the Patent Gold Standard 1 Task: patent annotation T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 25 / 33

47 Task: patent annotation Patent Gold Standard Patent Gold Standard Evaluation on the Patent Gold Standard Creation of the Gold Standard Selection of patents from two very different fields: mechanical engineering and biomedical technology. Manual annotation of USPTO and EPO patents by more than 10 person with several annotators for each patent. In total: 51 patents, 2,5 million characters. T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 26 / 33

48 Task: patent annotation Statistics on Gold Standard Patent Gold Standard Evaluation on the Patent Gold Standard Annotation type USPTO EPO Section.Abstract S.BackgroundArt S.BestMode 2 5 S.BibliographicData S.Bibliography 0 8 S.Claims 23 0 S.CrossReferenceToR.A. 6 1 S.DetailedDescription S.DisclosureOfInvention 3 6 S.DrawingDescription S.Effects 1 2 S.Examples S.PreferredEmbodiment 10 7 S.PriorArt 4 6 S.Sponsorship 2 0 S.SummaryOfTheInvent S.TechnicalField S.UsageOfInvention 1 6 Annotations/Doc Annotation type USPTO EPO Reference.Claim R.Example R.Figure R.Formula R.Literature R.Patent R.Table Annotations/Doc M.scalarValue Measurement.unit M.interval Annotations/Doc T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 27 / 33

49 Task: patent annotation Statistics on Gold Standard Patent Gold Standard Evaluation on the Patent Gold Standard Annotation type USPTO EPO Section.Abstract S.BackgroundArt S.BestMode 2 5 S.BibliographicData S.Bibliography 0 8 S.Claims 23 0 S.CrossReferenceToR.A. 6 1 S.DetailedDescription S.DisclosureOfInvention 3 6 S.DrawingDescription S.Effects 1 2 S.Examples S.PreferredEmbodiment 10 7 S.PriorArt 4 6 S.Sponsorship 2 0 S.SummaryOfTheInvent S.TechnicalField S.UsageOfInvention 1 6 Annotations/Doc Annotation type USPTO EPO Reference.Claim R.Example R.Figure R.Formula R.Literature R.Patent R.Table Annotations/Doc M.scalarValue Measurement.unit M.interval Annotations/Doc T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 27 / 33

50 Task: patent annotation Patent Gold Standard Evaluation on the Patent Gold Standard Results on Gold Standard, Micro-averaged precision, recall Annotation type USPTO EPO P. R. F1 P. R. F1 S.BackgroundArt S.DrawingDescr Section.Examples S.SummaryOf S.TechnicalField Reference.Claim R.Example R.Figure R.Formula R.Literature R.Patent R.Table M.scalarValue Measurement.unit M.interval T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 28 / 33

51 Task: patent annotation Section annotation: Examples (EPO) Patent Gold Standard Evaluation on the Patent Gold Standard T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 29 / 33

52 Task: patent annotation Patent Gold Standard Evaluation on the Patent Gold Standard Reference annotation: Literature (USPTO) T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 30 / 33

53 Task: patent annotation Patent Gold Standard Evaluation on the Patent Gold Standard Measurement annotation: interval (EPO) T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 31 / 33

54 Conclusion Conclusion Contents In conclusion... T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 32 / 33

55 Conclusion Conclusion Conclusion Fully automatic, scaling up method (million patents, 100GB). Quality close to human annotators. Perspective Machine learning from annotated patents. Semantic queries with Patent Ontology. T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 33 / 33

56 Conclusion Conclusion Conclusion Fully automatic, scaling up method (million patents, 100GB). Quality close to human annotators. Perspective Machine learning from annotated patents. Semantic queries with Patent Ontology. T. Heitz & GATE Team - NLP Group - Sheffield University Large-scale, Parallel Automatic Patent Annotation 33 / 33

Text Mining for Software Engineering

Text Mining for Software Engineering Faculty of Informatics Institute for Program Structures and Data Organization (IPD) Universität Karlsruhe (TH), Germany Department of Computer Science and Software