jprocessing Documentation

Size: px

Start display at page:

Download "jprocessing Documentation"

Shanna Bishop
6 years ago
Views:

1 jprocessing Documentation Release 0.1 Pulkit Kathuria Sep 17, 2017

3 Contents Requirements Links Install History Libraries and Modules Tokenize jtokenize.py Cabocha jcabocha.py Kanji / Katakana /Hiragana to Tokenized Romaji jconvert.py Longest Common String Japanese jprocessing.py Similarity between two sentences jprocessing.py Edict Japanese Dictionary Search with Example sentences Sample Ouput Demo Edict dictionary and example sentences parser Charset Links edict_search.py edict_examples.py Sentiment Analysis Japanese Text Wordnet files download links How to Use Japanese Word Polarity Score Contacts 13 i

4 ii

5 jprocessing Documentation, Release 0.1 Contents 1 Japanese NLP Library 1.1 Requirements * Links * Install * History 1.2 Libraries and Modules * Tokenize jtokenize.py * Cabocha jcabocha.py * Kanji / Katakana /Hiragana to Tokenized Romaji jconvert.py * Longest Common String Japanese jprocessing.py * Similarity between two sentences jprocessing.py 1.3 Edict Japanese Dictionary Search with Example sentences * Sample Ouput Demo * Edict dictionary and example sentences parser. * Charset * Links * edict_search.py * edict_examples.py 1.4 Sentiment Analysis Japanese Text * Wordnet files download links * How to Use * Japanese Word Polarity Score 1.5 Contacts Contents 1

6 jprocessing Documentation, Release Contents

7 CHAPTER Requirements Third Party Dependencies Cabocha Japanese Morphological parser Python Dependencies Python 2.6.* or above Links All code at jprocessing Repo GitHub Documentation and HomePage and Sphinx PyPi Python Package clone git@github.com:kevincobain2000/jprocessing.git Install In Terminal bash$ python setup.py install History 0.2 Sentiment Analysis of Japanese Text 3

8 jprocessing Documentation, Release Morphologically Tokenize Japanese Sentence Kanji / Hiragana / Katakana to Romaji Converter Edict Dictionary Search - borrowed Edict Examples Search - incomplete Sentence Similarity between two JP Sentences Run Cabocha(ISO configured) in Python. Longest Common String between Sentences Kanji to Katakana Pronunciation Hiragana, Katakana Chart Parser 4 Chapter Requirements

9 CHAPTER Libraries and Modules Tokenize jtokenize.py In Python >>> from jnlp.jtokenize import jtokenize >>> input_sentence = u'' >>> list_of_tokens = jtokenize(input_sentence) >>> print list_of_tokens >>> print '--'.join(list_of_tokens).encode('utf-8') Returns:... [u'\u79c1', u'\u306f', u'\u5f7c', u'\u3092', u'\uff15'...] Katakana Pronunciation: >>> print '--'.join(jreads(input_sentence)).encode('utf-8') Cabocha jcabocha.py Run Cabocha with original EUCJP or IS configured encoding, with utf8 python If cobocha is configured as utf8 then see this cabocha >>> from jnlp.jcabocha import cabocha >>> print cabocha(input_sentence).encode('utf-8') Output: 5

10 jprocessing Documentation, Release 0.1 <sentence> <chunk id="0" link="8" rel="d" score=" " head="0" func="1"> <tok id="0" read="" base="" pos="--" ctype="" cform="" ne="o"></tok> <tok id="1" read="" base="" pos="-" ctype="" cform="" ne="o"></tok> </chunk> <chunk id="1" link="2" rel="d" score=" " head="2" func="3"> <tok id="2" read="" base="" pos="--" ctype="" cform="" ne="o"></tok> <tok id="3" read="" base="" pos="--" ctype="" cform="" ne="o"></tok> </chunk> <chunk id="2" link="8" rel="d" score=" " head="6" func="6"> <tok id="4" read="" base="" pos="-" ctype="" cform="" ne="b-date"></tok> <tok id="5" read="" base="" pos="--" ctype="" cform="" ne="i-date"></tok> <tok id="6" read="" base="" pos="-" ctype="" cform="" ne="i-date"></tok> <tok id="7" read="" base="" pos="-" ctype="" cform="" ne="o"></tok> </chunk> Kanji / Katakana /Hiragana to Tokenized Romaji jconvert.py Uses data/katakanachart.txt and parses the chart. See katakanachart. >>> from jnlp.jconvert import * >>> input_sentence = u'' >>> print ' '.join(tokenizedromaji(input_sentence)) >>> print tokenizedromaji(input_sentence)...kisyoutyou ga ni ichi nichi gozen yon ji yon hachi hun hapyou si ta tenki gaikyou ni yoru to...[u'kisyoutyou', u'ga', u'ni', u'ichi', u'nichi', u'gozen',...] katakanachart.txt katakanachartfile and hiraganachartfile Longest Common String Japanese jprocessing.py On English Strings >>> from jnlp.jprocessing import long_substr >>> a = 'Once upon a time in Italy' >>> b = 'Thre was a time in America' >>> print long_substr(a, b) Output...a time in On Japanese Strings >>> a = u'' >>> b = u'' >>> print long_substr(a, b).encode('utf-8') Output 6 Chapter Libraries and Modules

11 jprocessing Documentation, Release Similarity between two sentences jprocessing.py Uses MinHash by checking the overlap English Strings >>> from jnlp.jprocessing import Similarities >>> s = Similarities() >>> a = 'There was' >>> b = 'There is' >>> print s.minhash(a,b) Japanese Strings >>> from jnlp.jprocessing import * >>> a = u'' >>> b = u'' >>> print s.minhash(' '.join(jtokenize(a)), ' '.join(jtokenize(b))) Similarity between two sentences jprocessing.py 7

12 jprocessing Documentation, Release Chapter Libraries and Modules

13 CHAPTER Edict Japanese Dictionary Search with Example sentences Sample Ouput Demo Edict dictionary and example sentences parser. This package uses the EDICT and KANJIDIC dictionary files. These files are the property of the Electronic Dictionary Research and Development Group, and are used in conformance with the Group s licence. Edict Parser By Paul Goins, see edict_search.py Edict Example sentences Parse by query, Pulkit Kathuria, see edict_examples.py Edict examples pickle files are provided but latest example files can be downloaded from the links provided Charset Two files utf8 Charset example file if not using src/jnlp/data/edict_examples To convert EUCJP/ISO to utf8 iconv -f EUCJP -t UTF-8 path/to/edict_examples > path/to/save_with_utf-8 ISO edict_dictionary file Outputs example sentences for a query in Japanese only for ambiguous words Links Latest Dictionary files can be downloaded here 9

14 jprocessing Documentation, Release edict_search.py author Paul Goins License included linktooriginal: For all entries of sense definitions >>> from jnlp.edict_search import * >>> query = u'' >>> edict_path = 'src/jnlp/data/edict-yy-mm-dd' >>> kp = Parser(edict_path) >>> for i, entry in enumerate(kp.search(query)):... print entry.to_string().encode('utf-8') edict_examples.py Note Only outputs the examples sentences for ambiguous words (if word has one or more senses) author Pulkit Kathuria >>> from jnlp.edict_examples import * >>> query = u'' >>> edict_path = 'src/jnlp/data/edict-yy-mm-dd' >>> edict_examples_path = 'src/jnlp/data/edict_examples' >>> search_with_example(edict_path, edict_examples_path, query) Output Sense (1) to recognize; EX:01 **We appreciate his talent. Sense (2) to observe; EX:01 **We have detected an abnormality on your x-ray. Sense (3) to admit; EX:01 **Mother approved my plan. EX:02 **Mother will never approve of my marriage. EX:03 **Father will never approve of my marriage. EX:04 **He doesn't approve of women smoking Chapter Edict Japanese Dictionary Search with Example sentences

15 CHAPTER Sentiment Analysis Japanese Text This section covers (1) Sentiment Analysis on Japanese text using Word Sense Disambiguation, Wordnet-jp (Japanese Word Net file name wnjpn-all.tab), SentiWordnet (English SentiWordNet file name SentiWordNet_3.*. txt) Wordnet files download links How to Use The following classifier is baseline, which works as simple mapping of Eng to Japanese using Wordnet and classify on polarity score using SentiWordnet. (Adnouns, nouns, verbs,.. all included) No WSD module on Japanese Sentence Uses word as its common sense for polarity score >>> from jnlp.jsentiments import * >>> jp_wn = '../../../../data/wnjpn-all.tab' >>> en_swn = '../../../../data/sentiwordnet_3.0.0_ txt' >>> classifier = Sentiment() >>> classifier.train(en_swn, jp_wn) >>> text = u'' >>> print classifier.baseline(text)...pos Score = Neg Score = Text is Positive 11

16 jprocessing Documentation, Release Japanese Word Polarity Score >>> from jnlp.jsentiments import * >>> jp_wn = '_dicts/wnjpn-all.tab' #path to Japanese Word Net >>> en_swn = '_dicts/sentiwordnet_3.0.0_ txt' #Path to SentiWordNet >>> classifier = Sentiment() >>> sentiwordnet, jpwordnet = classifier.train(en_swn, jp_wn) >>> positive_score = sentiwordnet[jpwordnet[u'']][0] >>> negative_score = sentiwordnet[jpwordnet[u'']][1] >>> print 'pos score = {0}, neg score = {1}'.format(positive_score, negative_score)...pos score = 0.625, neg score = Chapter Sentiment Analysis Japanese Text

17 CHAPTER Contacts Author pulkit[at]jaist.ac.jp [change at 13

sentiment_classifier Documentation

sentiment_classifier Documentation Release 0.4 Pulkit Kathuria January 07, 2015 Contents 1 Overview 3 2 Online Demo 5 3 Sentiment Classifiers and Data 7 4 Requirements 9 5 How to Install 11 6 Documentation