Representing Documents; Unit Testing II

Similar documents
A Simple Search Engine

Software Carpentry. Nicola Chiapolini. Physik-Institut University of Zurich. June 16, 2015

CSE : Python Programming. Decorators. Announcements. The decorator pattern. The decorator pattern. The decorator pattern

Agile development. Agile development work cycle. Reacting to bugs

CSE : Python Programming. Packages (Tutorial, Section 6.4) Announcements. Today. Packages: Concretely. Packages: Overview

CSC148 Intro. to Computer Science

Unit Testing and JUnit

Lotus IT Hub. Module-1: Python Foundation (Mandatory)

Requests Mock Documentation

Test-Driven Data Science: Writing Unit Tests for SASPy Python Data Processes

Testing My Patience. Automated Testing for Python Projects. Adam Lowry Portland Python Users Group

PYTHON CONTENT NOTE: Almost every task is explained with an example

Unit Tests. # Store the result of a boolean expression in a variable. >>> result = str(5)=='5'

CIS 121 Data Structures and Algorithms with Java Spring 2018

Intermediate Python 3.x

Programming Technologies for Web Resource Mining

Security Regression. Addressing Security Regression by Unit Testing. Christopher

CS Programming Languages: Python

PTN-202: Advanced Python Programming Course Description. Course Outline

CSCA08 Winter Week 12: Exceptions & Testing. Marzieh Ahmadzadeh, Brian Harrington University of Toronto Scarborough

REST in a Nutshell: A Mini Guide for Python Developers

Python Mock Tutorial Documentation

Exceptions & a Taste of Declarative Programming in SQL

JSONRPC Documentation

Abstract Data Types Chapter 1

A bit more on Testing

20.5. urllib Open arbitrary resources by URL

CNRS ANF PYTHON Objects everywhere

Assignment 6 Gale Shapley Python implementation Submit zip file to Canvas by 11:59 Saturday, September 23.

CSSE 220 Day 3. Check out UnitTesting and WordGames from SVN

Programming in Python 3

DSC 201: Data Analysis & Visualization

CIS 192: Lecture 9 Working with APIs

Babu Madhav Institute of Information Technology, UTU 2015

Unit 2: Python extras

Making Programs Fail. Andreas Zeller

Introduction to Python programming, II

About Python. Python Duration. Training Objectives. Training Pre - Requisites & Who Should Learn Python

Python: Short Overview and Recap

Software Testing Prof. Meenakshi D Souza Department of Computer Science and Engineering International Institute of Information Technology, Bangalore

Continuous Integration INRIA

Specter Documentation

Python Essential Reference, Second Edition - Chapter 5: Control Flow Page 1 of 8

Assignment 5: Testing and Debugging

tapi Documentation Release 0.1 Jimmy John

doubles Documentation

a Very Short Introduction to AngularJS

CS 301: Recursion. The Art of Self Reference. Tyler Caraza-Harter

Packtools Documentation

Mastering Python Decorators

The WSGI Reference Library

GIS 4653/5653: Spatial Programming and GIS. More Python: Statements, Types, Functions, Modules, Classes

61A Lecture 25. Friday, October 28

Family Map Server Specification

Splitting the pattern into the model (this stores and manipulates the data and executes all business rules).

Using the YANG Development Kit (YDK) with Cisco IOS XE

NLTK Tutorial: Basics

CSC326 Meta Programming i. CSC326 Meta Programming

CMPSCI 187 / Spring 2015 Hangman

yarl Documentation Release Andrew Svetlov

Introduction to Python programming, II


OR, you can download the file nltk_data.zip from the class web site, using a URL given in class.

CIS192 Python Programming

Lecture #12: Quick: Exceptions and SQL

python-aspectlib Release 0.4.1

Quiz. a 1 a 2 a 3 a 4. v 1 v 2 v 3. Let a 1 = [1, 0, 1], a 2 = [2, 1, 0], a 3 = [10, 1, 2], a 4 = [0, 0, 1]. Compute the row-matrix-by-vector product

UNIT -II. Language-History and Versions Introduction JavaScript in Perspective-

Reladomo Test Resource

Rapid Development with Django and App Engine. Guido van Rossum May 28, 2008

flask-jwt Documentation

API Wrapper Documentation

Lecture 9: GUI and debugging

magneto Documentation

Unit testing with pytest and nose 1

Unit Testing and Python. Pat Pannuto / Marcus Darden

Javadoc. Computer Science and Engineering College of Engineering The Ohio State University. Lecture 7

Objects and Classes. Chapter 8

61A Lecture 2. Friday, August 28, 2015

django-crucrudile Documentation

GitHub-Flask Documentation

Django Synctool Documentation

Testing with JUnit 1

Introduction to programming Lecture 5

Analyzing Tweets: Introducing Text Analysis Entities, parsing, regular expressions, frequency analysis

Abstract Data Types. CS 234, Fall Types, Data Types Abstraction Abstract Data Types Preconditions, Postconditions ADT Examples

Python Unit Testing and CI Workflow. Tim Scott, Python Meetup 11/07/17

Python INTRODUCTION: Understanding the Open source Installation of python in Linux/windows. Understanding Interpreters * ipython.

flask-jwt-simple Documentation

Introduction to Python

PTN-105 Python programming. Course Outline. Prerequisite: basic Linux/UNIX and programming skills. Delivery Method: Instructor-led training (ILT)

sinon Documentation Release Kir Chou

CIS 121 Data Structures and Algorithms with Java Spring 2018

Object Oriented Programming. Week 1 Part 3 Writing Java with Eclipse and JUnit

TAIL RECURSION, SCOPE, AND PROJECT 4 11

Connexion Documentation

THE LAUNCHER. Patcher, updater, launcher for Unity. Documentation file. - assetstore.unity.com/publishers/19358

Comp151 Lab Documentation using Doxygen

SeU Certified Selenium Engineer (CSE) Syllabus

Parts of Speech, Named Entity Recognizer

Transcription:

Representing Documents; Unit Testing II Benjamin Roth CIS LMU Benjamin Roth (CIS LMU) Representing Documents;Unit Testing II 1 / 26

Documents and Word Statistics Often, documents are the units a natural language processing system starts with. Document: the basic organizational unit that is read in before further processing. Documents can be Tweets Wikipedia articles Product reviews Web pages... In the following we will look into how to represent documents how to write a basic search engine over documents Benjamin Roth (CIS LMU) Representing Documents;Unit Testing II 2 / 26

Representing Documents in Python Let s write a simple class for text documents. How to represent a document in python? What pieces of information do we want to store? Benjamin Roth (CIS LMU) Representing Documents;Unit Testing II 3 / 26

Representing Documents in Python How to represent a document in python? What pieces of information do we want to store? The raw text (string) of the document The tokenized text (list of strings) The token frequencies of the documents A unique identifier for each document... Benjamin Roth (CIS LMU) Representing Documents;Unit Testing II 4 / 26

Token frequencies How often did a particular word occur in a text? id:doc1 text: The raw text string of the document The tokenized text list of strings The token frequencies of the documents A unique identifier for each document Benjamin Roth (CIS LMU) Representing Documents;Unit Testing II 5 / 26

Token frequencies How often did a particular word occur in a text? id:doc1 text: The raw text string of the document The tokenized text list of strings The token frequencies of the documents A unique identifier for each document the : 5 of : 3 text, 2 document, 2 for, 1... Benjamin Roth (CIS LMU) Representing Documents;Unit Testing II 6 / 26

Token frequencies How often did a particular word occur in a text? id:doc1 text: The raw text string of the document The tokenized text list of strings The token frequencies of the documents A unique identifier for each document the : 5 of : 3 text, 2 document, 2 for, 1... This is an important summary information - we can measure similarity between documents by computing the overlap of their token frequency tables. (tfidf+cosine similarity) Benjamin Roth (CIS LMU) Representing Documents;Unit Testing II 7 / 26

A simple document class from nltk import FreqDist, word_tokenize class TextDocument: def init (self, text, identifier=none): """ Tokenizes a text and creates a document.""" # Store original version of text. self.text = text # Create dictionaries that maps tokenized, # lowercase words to their counts in the document. self.token_counts = # TODO self.id = identifier How to tokenize a Text? How to create a dictionary from words to counts? Benjamin Roth (CIS LMU) Representing Documents;Unit Testing II 8 / 26

A simple document class How to tokenize a Text? Split using regular expressions, e.g.: >>> input = "Dr. Strangelove is the U.S. President s advisor." >>> re.split(r \W+, input) [ Dr, Strangelove, is, the, U, S, President, \ s, advisor, ] Use nltk: >>> from nltk import word_tokenize >>> word_tokenize(input) [ Dr., Strangelove, is, the, U.S., President, \ " s", advisor,. ] Define a helper function: def normalized_tokens(text): """ Returns lower-cased tokens. >>> normalized_tokens(input) [ dr., strangelove, is, the, u.s., president, \ " s", advisor,. ]""" pass # TODO Benjamin Roth (CIS LMU) Representing Documents;Unit Testing II 9 / 26

A simple document class How to create a dictionary from words to counts? White board. Using dictionary comprehension? Using a for loop? Using the nltk frequency distribution (FreqDist)? check the documentation. Benjamin Roth (CIS LMU) Representing Documents;Unit Testing II 10 / 26

How to create a document Document can be created from different starting points... By setting text and id as strings. By reading plain text file. By reading compressed text file. By parsing XML. By requesting and parsing an HTML file.... However, only one constructor is possible in python. Arguments of the constructor: the basic elements which are common to all creation scenarios, and define the object (in our case text and document id) Similar to multiple constructors: Several different static class methods, that call the underlying base constructor. (This is a simple version of the so-called factory pattern) Benjamin Roth (CIS LMU) Representing Documents;Unit Testing II 11 / 26

Multiple static constructors class TextDocument: def init (self, text, identifier=none):... @classmethod def from_text_file(cls, filename): filename = os.path.abspath(filename) # TODO: read content of file into string # variable text. #... return cls(text, filename) @classmethod def from_http(cls, url, timeout_ms=100):... Benjamin Roth (CIS LMU) Representing Documents;Unit Testing II 12 / 26

Class methods The first argument (often named cls) of a function with the @classmethod function decorator, refers to the class itself (rather than the object). The constructor (or any other class method) can then be called from within that function using cls(...) What is the advantage of using... @classmethod def from_text_file(cls, filename): #... return cls(text, filename)... over using? @classmethod def from_text_file(cls, filename): #... return TextDocument(text, filename) Benjamin Roth (CIS LMU) Representing Documents;Unit Testing II 13 / 26

Brainstorming What are cases where it can make sense to use factory constructors (i.e. create instances using a method with the @classmethod decorator)? Benjamin Roth (CIS LMU) Representing Documents;Unit Testing II 14 / 26

Use cases for Factory Constructors If you create instances...... by reading from different sources. Examples: files, http, sql-database, mongodb, elastic Search index... by reading from different formats. Examples: xml, json, html... by parsing string options. Example: a=mytarclass(extract=true, verbose=true, gzip=true, \ use_archive_file=true) b=mytarclass.fromoptions("xzvf") (Can you guess what this class might do?)... where the same argument type is interpreted/parsed differently Example: a=mytime.fromtimex2("2017-08-01") b=mytime.fromgerman("1. August 2017")... Benjamin Roth (CIS LMU) Representing Documents;Unit Testing II 15 / 26

Next time: How to write the simple Search Engine Demo Questions? Benjamin Roth (CIS LMU) Representing Documents;Unit Testing II 16 / 26

Testing with the unittest module Benjamin Roth (CIS LMU) Representing Documents;Unit Testing II 17 / 26

Test-Driven Development (TDD): Recap Write tests first (, implement functionality later) Add to each test an empty implementation of the function (use the pass-statement) The tests initially all fail Then implement, one by one, the desired functionality Advantages: Define in advance what the expected input and outputs are Also think about important boundary cases (e.g. empty strings, empty sets, float(inf), 0, unexpected inputs, negative numbers) Gives you a measure of progress ( 65% of the functionality is implemented ) - this can be very motivating and useful! Benjamin Roth (CIS LMU) Representing Documents;Unit Testing II 18 / 26

The unittest module Similar to Java s JUnit framework. Most obvious difference to doctest: test cases are not defined inside of the module which has to be tested, but in a separate module just for testing. In that module... import unittest import the functionality you want to test define a class that inherits from unittest.testcase This class can be arbitrarily named, but XyzTest is standard, where Xyz is the name of the module to test. In XyzTest, write member functions that start with the prefix test... These member functions are automatically detected by the framework as tests. The tests functions contain assert-statements Use the assert-functions that are inherited from unittest.testcase (do not use the Python built-in assert here) Benjamin Roth (CIS LMU) Representing Documents;Unit Testing II 19 / 26

Different types of asserts Question:... what is the difference between a == b and a is b? Benjamin Roth (CIS LMU) Representing Documents;Unit Testing II 20 / 26

Example: using unittest test square.py import unittest from example_module import square class SquareTest(unittest.TestCase): def testcalculation(self): self.assertequal(square(0), 0) self.assertequal(square(-1), 1) self.assertequal(square(2), 4) Benjamin Roth (CIS LMU) Representing Documents;Unit Testing II 21 / 26

Example: running the tests initially test square.py $ python3 -m unittest -v test_square.py testcalculation (test_square.squaretest)... FAIL ====================================================================== FAIL: testcalculation (test_square.squaretest) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/ben/tmp/test_square.py", line 6, in testcalculation self.assertequal(square(0), 0) AssertionError: None!= 0 ---------------------------------------------------------------------- Ran 1 test in 0.000s FAILED (failures=1) $ Benjamin Roth (CIS LMU) Representing Documents;Unit Testing II 22 / 26

Example: running the tests with implemented functionality $ python3 -m unittest -v test_square.py testcalculation (test_square.squaretest)... ok ---------------------------------------------------------------------- Ran 1 test in 0.000s OK $ Benjamin Roth (CIS LMU) Representing Documents;Unit Testing II 23 / 26

SetUp and Teardown setup and teardown are recognized and exectuted automatically before (after) the unit test are run (if they are implemented). setup: Establish pre-conditions that hold for several tests. Examples: Prepare inputs and outputs Establish network connection Read in data from file teardown (less frequently used): Code that must be executed after tests finished Example: Close network connection Benjamin Roth (CIS LMU) Representing Documents;Unit Testing II 24 / 26

Example using setup and teardown class SquareTest(unittest.TestCase): def setup(self): self.inputs_outputs = [(0,0),(-1,1),(2,4)] def testcalculation(self): for i,o in self.inputs_outputs: self.assertequal(square(i),o) def teardown(self): # Just as an example. self.inputs_outputs = None Benjamin Roth (CIS LMU) Representing Documents;Unit Testing II 25 / 26

Conclusion Test-driven development Using unittest module Also have a look at the online documentation! https://docs.python.org/3/library/unittest.html Questions? Benjamin Roth (CIS LMU) Representing Documents;Unit Testing II 26 / 26