Visualizing Translation Variation of Othello : A Survey of Text Visualization and Analysis Tools

Size: px

Start display at page:

Download "Visualizing Translation Variation of Othello : A Survey of Text Visualization and Analysis Tools"

Gervais Wells
5 years ago
Views:

1 Visualizing Translation Variation of Othello : A Survey of Text Visualization and Analysis Tools Zhao Geng 1, Robert S.Laramee 1, Tom Cheesman 2, Stephan Thiel 3 1 Visual Computing Group, Swansea University cszg,r.s.laramee@swansea.ac.uk 2 College of Arts and Humanities, Swansea University, UK, t.cheesman@swansea.ac.uk and 3 April 30, Abstract Being a global icon, Shakespeare s plays have been translated into dozens of languages for about 300 years. Also, there are many re-translations to the same language, for example, there are more than 40 translation of Othello into German. Every translation is a different interpretation of the play. These large quantities of translations reflect changing culture or express individual thought by the authors. They build a wide connection between different regions and reveal a retrospective view of their histories. At the moment, researchers from Modern Languages collect a large number of translations of William Shakespeare s play, Othello. In recent years, since roughly 2005, we have witnessed a rapid increase in the number of off-the-shelf text visualization tools which can benefit this study. Here we set out to utilize existing text visualization techniques and tools in order to gain a better understanding of the various translations of the Shakespeare s work. In particular, we would like to learn more about which content varies highly with each translation, and which content remains table. We would also like to form hypothesis as to the implications behind this variations. 2 Introduction The goal of this project is to visualize the various translations of Shakespeare s work, Othello. The initial task is to identify and extract the nonsemantic features from the original text of a document corpus. The non-semantic features refer to the number of words, tokens and patterns in the concordance. Text pre-processing facilitates the construction of text concordance, term relations, document relevance and other properties of interest. Based on the extracted information, various visualizations can be applied. In this document, we present the result of our survey on the state-ofart techniques and free, off-the-shelf tools for text analysis and visualization. We conduct some experiments with a selection of tools using Shakespear s Othello as an example. We investigate if any of the freely available tools can provide clues into the variation of German translations of Othello. The rest of the paper is organized as follows: 3 Related Work Since 2005, we observe a rapid increase in the number of text visualization prototypes being developed. As a result, various visual representations for text streams and documents are proposed to effectively present and explore the text features. In this section, we present some interesting and novel text visualizations which are able to present some of the extracted text attributes. The ThemeRiver visualization, proposed by Havre, Hetzler, Whitney & Nowell (2002), depicts thematic variations over time within a large collection of documents. The thematic changes are shown in the context of a time line and corresponding external events. The focus on temporal thematic change within a context framework allows a user to discern patterns that suggest relationships or trends. A Document 1

2 Contrast Diagram, proposed by Clark (2008), is a visual summary of the content of two text documents that illustrates shared words, words that are unique to one document or the other, word frequency, relative size of the two documents, distribution of emotional tone within the documents, related words based on co-occurrence, and the most common word in each document segment. It uses the familiar bubble technique and effective use of colour to contrast topic usage in two bodies of text. Parallel Tag Clouds, proposed by Collins, B.Viegas & Wattenberg (2009), combines the parallel coordinates and tag clouds to provide a rich overview of a document collection. Each vertical axis represents a category. The words in each category are summarized in the form of tag clouds along the vertical axis. When clicking on a word, the same word appearing in other vertical axes is connected. Several filters can be defined to reduce the amount of text displayed in each category. This could help create more screen space and improve the clarity of the visualization. DocuBurst proposed by Collins, Carpendale & Penn (2009) uses a radial, space-filling layout to depict the document content by visualizing the structured text. The structured text in this visualization refers to the is-kind-of or is-type-of relationship. For example, robin and redbreast is kind of a bird. A bird is kind of an animal. An animal is kind of an organism or a living thing. A living thing is an entity. As we can see, such structured text can form a tree hierarchy, with the entity as the root and robin or redbreast as the leaf. The root node of DocBurst visualization is shown as a circle. All other nodes are assigned to a sector of an annulus. The angular width of each sector is mapped to the number of leaves or children. SparkClouds proposed by Lee, Riche, Karlson & Carpendale (2010) integrates sparklines into a tag cloud to convey trends between multiple tag clouds. The sparklines can be used to present the trend over time. From a controlled study that compares SparkClouds with two traditional trend visualizations, such as multiple line graphs, stacked bar charts and Parallel Tag Clouds, results show that SparkClouds is more effective at showing trends along the time. ManiWordle proposed by Koh, Lee, Kim & Seo (2010) provides flexible control such that user can directly manipulate on the original Wordle to change the layout, colour and etc. Figure 1: This figure shows the time and place of the collected 47 translations in scatterplots. The x-axis depicts the publication date and the y-axis depicts the place of the publication. The dot size corresponds to the year of publication, where the newer version has a larger size. 4 Background Data Discription The domain experts from Modern Languages have collected more than 47 different German translations of Shakepeare s play, Othello. These translations are published from 1766 to 2006 in 5 different countries, as shown in Figure 1. The detail of the authors and the title of each translation can be visualized by the Matrix Chart in ManyEyes Viegas, Wattenberg, van Ham, Kriss & Mckeon (2007), as shown in Figure 2. Apart from the main texts of the Othello being collected, the domain experts also list some of the important metadata, such as the originality score of each translation, which is manually calculated to give an relatively objective measurement of how similar each translation is to the original play. In the next few sections, we apply the 23 different versions of Othello translations on various available visualization tools. 5 Text Preprocessing Using of the text preprocessing tool introduced in Section 5, we can collect a wide range of text 2

Figure 2: This figure shows the detail of our Othello collection using Matrix Chart in ManyEyes developed by Viegas et al. (2007). The row is mapped to the publication country.

The domain experts are of particular interest to the variations of different segments or paragraphs in various documents. The software WordSmith, developed by Word- Smith.

The outcome of the analysis involves loads of statistical data about the word frequencies in the texts (both absolute values and compared with other texts, or compared with external corpora) and key

3 Figure 2: This figure shows the detail of our Othello collection using Matrix Chart in ManyEyes developed by Viegas et al. (2007). The row is mapped to the publication country. The column is mapped to the authors and color represents the title of the book. attributes, such as word relationships, word frequency and sentence segmentation. The domain experts are of particular interest to the variations of different segments or paragraphs in various documents. The software WordSmith, developed by Word- Smith.org (1996), is able to generate various text attributes, such as word frequency, parts of speech and any other statistical information. The outcome of the analysis involves loads of statistical data about the word frequencies in the texts (both absolute values and compared with other texts, or compared with external corpora) and key words list (words which occur unusually frequently in comparison with some kind of reference corpus). A screen shot of the software is shown in Figure 3. Figure 4: This figure shows the interface of Concordance developed by Watt (2009). The software Concordance developed by Watt (2009) is created for people who need in-depth language or text analysis. It provides a free trial for the user. Concordance is able to generate indexes and word lists, count word frequencies, compare different usages of a word, analyse keywords and publish the analysis results on the web. The screen shot of the software is shown in Figure 4 6 Exploratory Specification Some questions that users may ask of a digital resource on Shakespeare in Translation : Where, when, into which languages, and into which kinds of text has Othello been translated? How have translations influenced on another? How do the versions vary in general? How do the versions vary in particular? How do translations deal with any specific term or phrase that the domain experts are interested in? Figure 3: This figure shows the interface of Word- Smith developed by WordSmith.org (1996). 7 State-of-art Text Visualization In this section, we investigate the text visualization tools which are free to the public. Our work 3

Figure 5: This figure shows the TextArc(Paley (2002)) visualization of the Shakepear s book Othello in English. The entire text is depicted as an ellipse.

4 Figure 5: This figure shows the TextArc(Paley (2002)) visualization of the Shakepear s book Othello in English. The entire text is depicted as an ellipse. Each line is drawn on the outside of the ellipse. In the middle of an ellipse is rendered. can facilitate modern language experts search for visualizations that benefit most for the analysis of their collected Shakespeare s translations. In this section, we experiment with these freely available tools on the 23 German translations of Othello s speech to the senate appeared in Shakepeare s play Othello. The following list summarizes the tools and their visualizations we have tried: The ManyEyes(Viegas et al. (2007)) application with the following visualizations: Wordle, Tag Cloud, Phrase Net, Word Tree The TextArc(Paley (2002)) application with the following: Text Arc The NameVoyager(Wattenberg (2005)) application with the following visualizations: Stack Graph The Tagline Generator(Mehta (2006)): Cloud, Time Line Tag The Wordle(Jonathan Feinberg (2009)): Wordle A TextArc( Paley (2002)) is a visual representation of the entire text on a single page. It is an advanced combination of an index, concordance, and summary of the text. Animation is provided to enable the user keep track of the variations of relationship between different words, phrases and sentences. In TextArc, the entire text is depicted as an ellipse. Each line is drawn on the outside of the ellipse. It preserves the typographic structure of the text. A word appears in the middle of an ellipse. A word with high frequency is displayed in brighter color and larger size. If a word is used more than once, it appears at the center of all of its mentions. The accepted data for TextArc is only from the TextArc library. Figure 5 shows the visualization of the Shakepeare s play Othello generated by TextArc. The NameVoyager(Wattenberg (2005)) as a webbased visualization of historical trends in baby naming, has proved remarkably popular. The method used to visualize the data is straightforward: given a set of name popularity time series, a set of stacked graphs is produced. However, this tool does not accept user customized data sets. 4

Figure 6: This figure shows the visualizations of two German translations of Othello using Tagline Generator(Mehta (2006)).

Figure 7: This figure shows the Tag Clouds(B.Scott. et al. (2008)) of Othello using ManyEyes(Viegas et al. (2007)).

The ManyEyes does not provide the text preprocessing option in the Tag Cloud, such as removing the common words.

Once the users have populated the data source and configured the generator, it creates a list of all the unique words that have been used and counts their frequency.

5 Figure 6: This figure shows the visualizations of two German translations of Othello using Tagline Generator(Mehta (2006)). By moving the scroll bar, user is able to see the visualization of each individual document. We experimented more than 20 German translations of the Othello play using this tool. Figure 7: This figure shows the Tag Clouds(B.Scott. et al. (2008)) of Othello using ManyEyes(Viegas et al. (2007)). The left image depicts the tag clouds for every single word, whereas the right image shows the Tag Clouds of pairs of words staring with letter b. The ManyEyes does not provide the text preprocessing option in the Tag Cloud, such as removing the common words. The Tagline Generator(Mehta (2006)) is a simple PHP codebase that lets the user generate chronological tag clouds from simple text data sources without manually tagging the data entries. Once the users have populated the data source and configured the generator, it creates a list of all the unique words that have been used and counts their frequency. Next it identifies the different variations of words and combines them under the most common variation using the Porter Stemming Algorithm. The size of a word indicates its frequency in the document. The brightness indicates the year of the document, the newer document is brighter. The accepted data format of tagline generator is an XML file. Figure 6 shows the TagLine visualization of 23 German translations of Shakespeare s play, Othello. ManyEyes(Viegas et al. (2007)) is a free website where anyone can upload, visualize, and discuss data. It is an experiment created by the Visual Communication Lab. The input data of ManyEyes is not obtained by files, instead it accepts any forms of free text copied and pasted from any sources. It provides a number of text visualizations, such as Tag Clouds, Phrase Net and Word Tree. Again, we apply our Othello data, which contains 23 various German translations of the play, to the visualizations in this tool. The standard Tag Clouds( B.Scott. et al. (2008)) is a popular text visualization for depicting term frequencies. Tags are usually single words and are normally listed alphabetically, and the importance of each tag is shown with font size or color, as shown in Figure 7. Word Tree( Wattenberg & B.Viegas (2008)) is a graphical version of the traditional keyword-in- 5

Word Attribute Visual Mapping Figure 12: This table is a classification matrix with the column represents the visual mapping elements and the row with the text attribute.

6 Word Attribute Visual Mapping Figure 12: This table is a classification matrix with the column represents the visual mapping elements and the row with the text attribute. Each element of the matrix represents a visualization technique we have introduced in this paper. Figure 8: This image shows the Word Tree(Wattenberg & B.Viegas (2008)) of Othello data using ManyEyes(Viegas et al. (2007)). When we input the word liebte, all of sentences beginning with this word are shown. The size of a word represents its frequency. Figure 9: This image shows the Phrase Net(van Ham et al. (2009)) of our Othello data using ManyEyes(Viegas et al. (2007)). It depicts any two words connected with open space in the Othello play. The size of the words depict the word frequency. context method, and enables rapid querying and exploration of bodies of the text, as shown in Figure 8. It is a visual search tool for unstructured text, such as a book, article, speech or poem. It allows the user choose a word or phrase and shows them all the different contexts in which the word or phrase appears. The contexts are arranged in a tree-like branching structure to reveal recurrent themes and phrases. The size of a word represents its frequency. Phrase Nets( van Ham et al. (2009)) illustrates the relationships between different words used in a text. It uses a simple form of pattern matching to provide multiple views of the concepts contained in a book, speech, or poem. Such as given a network of words and connection pattern word and, where two words are connected if they appear together in a phrase of the form X and Y, as shown in Figure 9. TagCrowd(Steinbock (2008)) is a web application for visualizing word frequencies in any usersupplied text by creating a tag cloud or text cloud( B.Scott. et al. (2008)). The advantage of TagCrowd is that user can define the common words themselves and these common words will be automatically removed from the original text. The accepted input text is the same as ManyEyes. Figure 10 shows the Tag Cloud visualization of our Othello data sets. The common German words are removed. 6

Wordle(Jonathan Feinberg (2009)) is a tool for generating word clouds from text that the user provides. The accepted text format is the same as ManyEyes.

The clouds give greater prominence to words that appear more frequently in the source text. The user can tweak their clouds with different fonts, layouts, and color schemes.

7 Wordle(Jonathan Feinberg (2009)) is a tool for generating word clouds from text that the user provides. The accepted text format is the same as ManyEyes. Wordles are more artistically arranged (and often vibrantly colored) versions of a text. They tend to be less directly insightful as visualization, but often give a more personal feel to a document. The clouds give greater prominence to words that appear more frequently in the source text. The user can tweak their clouds with different fonts, layouts, and color schemes. As shown in Figure 11, is the wordle visualization of some passages from Othello. The common German words are removed. In this section, we also propose a classification matrix to categorize different visualization techniques. The column of our matrix depicts the encoded visual element. The size, orientation, shape and color are used to to encode an individual value, and lines / curves and trees are used to encode relationship and hierarchy. The row of the matrix depicts the text attributes including the individual word frequency and words relationship. An overview of the table is shown in Figure 12. Figure 10: This image shows the TagCrowd(Steinbock (2008)) visualization of a passage from Othello. The common German words or stop lists are manually defined and reduced from the original text. Figure 11: This image shows the Wordle visualization(jonathan Feinberg (2009)) of a passage from Othello data sets. The common German words are reduced. 8 Proposed Visualization In this section, we will briefly introduce our proposed visualization for our Othello data set. Parallel coordinates introduced by Inselberg (2009) is a widely used visualization technique for exploring large, multidimensional data sets. It is powerful in revealing a wide range of data characteristics such as different data distributions and functional dependencies as stated. The textual information of each document can be transformed into a vector. In our parallel coordinates, we encode the document dimensions as term frequencies. Domain experts from Arts and Humanities selected eight interesting translations according to their similarity score. For initial analysis, we chose a significant passage from the play, Othello s big speech to the Venetian Senate in Act1, Scene3: the longest single speech in the play (about 300 words in Shakespeare s text). Figure 13 shows an overview of our visualization. The column on the far left displays a list of selected keywords: these are most frequently occurring significant words in the document corpus. The parallel coordinates present a focused view of keyword frequencies. Each document is represented by a vertical axis. In order to maintain a unified scale, the height of each vertical axis is made proportional to the range between each document s minimal and maximal word frequencies. Zero frequency simply means that a keyword has not occurred in that document. The thickness of each vertical axis is mapped to the document s similarity with others in terms of LSI score: a thicker line means a higher similarity value. The number of occurrences of each keyword in each document is connected by a polyline. Each polyline is rendered in a different color to enable visual discrimination. The text boxes below the parallel coordinates provide context views for keywords selected by the user. Each text box repre- 7

8 Figure 13: This image shows an overview of our visualization. The parallel coordinates illustrates a focus view of the term frequency. The text boxes below the parallel coordinates show the context views. They present the entire sentences from the original text where each keyword appears. sents an individual document and shows the entire sentences from the original text where each selected keyword occurs. We also apply the edge bundling to enhance the visual clustering and user is able to control the curvature of the edge. Curves with the least curvature become a straight line. 9 Conclusion We have surveyed a range of off-the-shelf, freely available information visualization tools for the visual analysis and investigation of Othello data set. Although there are many options available, only a select few visualizations are useful for this particular application. Our study also serves as a useful tool for readers interested in gaining an overview of existing, free, state-of-the-art information visualization tools for text analysis. We also introduce our proposed visualization system designed for the analysis of Othello data set. References B.Scott., G.Carl. & N.Miguel. (2008), Seeing Things in the Clouds: The Effect of Visual Features on Tag Cloud Selections, in HT 08: Proceedings of the nineteenth ACM conference on Hypertext and hypermedia, ACM, New York, NY, USA, pp Clark, J. (2008), Document contrast diagrams. Collins, C., B.Viegas, F. & Wattenberg, M. (2009), Parallel Tag Clouds to Explore and Analyze Facted Text Corpora, in IEEE Symposium on Visual Analytics Science and Technology, Computer Society, pp Collins, C., Carpendale, M. S. T. & Penn, G. (2009), DocuBurst: Visualizing Document Content using Language Structure, Computer Graphics Forum 28(3), Havre, S., Hetzler, E., Whitney, P. & Nowell, L. (2002), ThemeRiver: Visualizing Thematic Changes in Large Document Collections, IEEE Transactions on Visualization and Computer Graphics 8(1), Inselberg, A. (2009), Parallel Coordinates: Visual Multidimensional Geometry and Its Applications, Springer. Jonathan Feinberg (2009), Wordle: Beautiful Word Clouds. 8

9 Koh, K., Lee, B., Kim, B. H. & Seo, J. (2010), ManiWordle: Providing Flexible Control over Wordle, IEEE Transactions on Visualization and Computer Graphics 16(6), Lee, B., Riche, N. H., Karlson, A. K. & Carpendale, M. S. T. (2010), SparkClouds: Visualizing Trends in Tag Clouds, IEEE Transactions on Visualization and Computer Graphics 16(6), Mehta, C. (2006), Tagline Generator - Timeline-based Tag Clouds. Last Access Date: Paley, W. B. (2002), TextArc: An Alternative Way to View Text. Steinbock, D. (2008), TagCrowd: Joining the Crowd Together. van Ham, F., Wattenberg, M. & Viégas, F. B. (2009), Mapping Text with Phrase Nets, IEEE Transactions on Visualization and Computer Graphics 15(6), Viegas, F. B., Wattenberg, M., van Ham, F., Kriss, J. & Mckeon, M. (2007), ManyEyes: A Site for Visualization at Internet Scale., IEEE Transactions on Visualization and Computer Graphics 13(6), Watt, R. J. C. (2009), Concordance Wattenberg, M. (2005), Baby Names Visualization, and Social Data Analysis, in Proceedings of 2005 IEEE Symposium on Information Visualization (INFOVIS), pp Wattenberg, M. & B.Viegas, F. (2008), The Word Tree, an Interactive Visual Concordance, IEEE Transactions on Visualization and Computer Graphics 14(6), WordSmith.org (1996), WordSmith Tools. Last Access Date:

Txt2vz: a new tool for generating graph clouds

Txt2vz: a new tool for generating graph clouds HIRSCH, L and TIAN, D Available from Sheffield Hallam University Research Archive (SHURA) at: http://shura.shu.ac.uk/6619/