Evaluation of alignment methods for HTML parallel text 1

Similar documents
HTML: Parsing Library

HTML: Parsing Library

Oliver Pott HTML XML. new reference. Markt+Technik Verlag

HTML TAG SUMMARY HTML REFERENCE 18 TAG/ATTRIBUTE DESCRIPTION PAGE REFERENCES TAG/ATTRIBUTE DESCRIPTION PAGE REFERENCES MOST TAGS

The [HTML] Element p. 61 The [HEAD] Element p. 62 The [TITLE] Element p. 63 The [BODY] Element p. 66 HTML Elements p. 66 Core Attributes p.

ROLE OF WEB BROWSING LAYOUT ENGINE EVALUATION IN DEVELOPMENT

Index. CSS directive, # (octothorpe), intrapage links, 26

"utf-8";

5-Sep-16 Copyright 2016 by GemTalk Systems LLC 1

Designing UI. Mine mine-cetinkaya-rundel

HTML Markup for Accessibility You Never Knew About

@namespace url( /* set default namespace to HTML */ /* bidi */

Canvas & Brush Reference. Source: stock.xchng, Maarten Uilenbroek

Wireframe :: tistory wireframe tistory.

QUICK REFERENCE GUIDE

COPYRIGHTED MATERIAL. Contents. Chapter 1: Creating Structured Documents 1

Chapter 2:- Introduction to XHTML. Compiled By:- Sanjay Patel Assistant Professor, SVBIT.

WML2.0 TUTORIAL. The XHTML Basic defined by the W3C is a proper subset of XHTML, which is a reformulation of HTML in XML.

Deccansoft Software Services

Certified HTML Designer VS-1027

Silk Test Object Recognition with the Classic Agent

Name Related Elements Type Default Depr. DTD Comment

SilkTest 2009 R2. Rules for Object Recognition

CPET 499/ITC 250 Web Systems. Topics

CHAPTER 7 USER INTERFACE MODEL

Stand-off annotation of web content as a legally safer alternative to bitext crawling for distribution 1

A HTML document has two sections 1) HEAD section and 2) BODY section A HTML file is saved with.html or.htm extension

Appendix A. XHTML 1.1 Module Reference

COPYRIGHTED MATERIAL. Contents. Introduction. Chapter 1: Structuring Documents for the Web 1

Structure Bars. Tag Bar

HTML Summary. All of the following are containers. Structure. Italics Bold. Line Break. Horizontal Rule. Non-break (hard) space.

Certified HTML5 Developer VS-1029

<page> 1 Document Summary Document Information <page> 2 Document Structure Text Formatting <page> 3 Links Images <page> 4

HTML Cheat Sheet for Beginners

Beginning Web Programming with HTML, XHTML, and CSS. Second Edition. Jon Duckett

HTML and CSS COURSE SYLLABUS

UNIT II Dynamic HTML and web designing

Symbols INDEX. !important rule, rule, , 146, , rule,

Table-Based Web Pages

Internet publishing HTML (XHTML) language. Petr Zámostný room: A-72a phone.:

Go.Web Style Guide. Oct. 16, Hackensack Ave Hackensack, NJ GoAmerica, Inc. All rights reserved.

Electronic Books. Lecture 6 Ing. Miloslav Nič Ph.D. letní semestr BI-XML Miloslav Nič, 2011

BACKGROUND. HTTP is a 2-phase protocol used by most web applications and all web browsers. The response is usually an HTML document

Winwap Technologies Oy. WinWAP Browser. Application Environment

HTML BEGINNING TAGS. HTML Structure <html> <head> <title> </title> </head> <body> Web page content </body> </html>

13.8 How to specify alternate text

Cascading Style Sheet

Web Development & Design Foundations with HTML5 & CSS3 Instructor Materials Chapter 2 Test Bank

Bye Bye. Warning! HTML5 The Future of HTML. Saturday, October 9, All of this may change! The spec is still in development!

HTML5. The Future of HTML. Washington University in St. Louis R. Scott Granneman

Chapter 1 Self Test. LATIHAN BAB 1. tjetjeprb{at}gmail{dot}com. webdesign/favorites.html :// / / / that houses that information. structure?

Creating and Setting Up the Initial Content

Programming of web-based systems Introduction to HTML5

Inline Elements Karl Kasischke WCC INP 150 Winter

GESTIONE DEL BACKGROUND. Programmazione Web 1

Continues the Technical Activities Originated in the WAP Forum

Tutorial 2 - HTML basics

Announcements. Paper due this Wednesday

Static Webpage Development

Advanced Web Programming C2. Basic Web Technologies

Tables & Lists. Organized Data. R. Scott Granneman. Jans Carton

Web Development & Design Foundations with HTML5 & CSS3 Instructor Materials Chapter 2 Test Bank

Internet Explorer HTML 4.01 Standards Support Document

Text and Layout. Digital Multimedia, 2nd edition Nigel Chapman & Jenny Chapman Chapter 11. This presentation 2004, MacAvon Media Productions

Basics of Web Design, 3 rd Edition Instructor Materials Chapter 2 Test Bank

Evaluation of Different Approaches to Training a Genre Classifier

ATL Transformation Example

Web Development and Design Foundations with HTML5 8th Edition

HTML. HTML is now very well standardized, but sites are still not getting it right. HTML tags adaptation

A Brief Introduction to HTML

Bitextor s participation in WMT 16: shared task on document alignment

1/6/ :28 AM Approved New Course (First Version) CS 50A Course Outline as of Fall 2014

CSI 3140 WWW Structures, Techniques and Standards. Markup Languages: XHTML 1.0

Web development using PHP & MySQL with HTML5, CSS, JavaScript

CS105 Course Reader Appendix A: HTML Reference

HTML. HTML Evolution

Web Designing HTML5 NOTES

Portia Documentation. Release Scrapinghub

Web Design and Application Development

CHAPTER 2 MARKUP LANGUAGES: XHTML 1.0

COMP519: Web Programming Lecture 4: HTML (Part 3)

Government of Karnataka Department of Technical Education Bengaluru

IMS Question and Test Interoperability Information Model v2.0

recall: a Web page is a text document that contains additional formatting information in the HyperText Markup Language (HTML)

MSEP Partner Portal JSON Feed Standards

CSC 121 Computers and Scientific Thinking

WebMining: An unsupervised parallel corpora web retrieval system

XHTML Mobile Profile. Candidate Version Feb Open Mobile Alliance OMA-TS-XHTMLMP-V1_ C

Skill Area 323: Design and Develop Website. Multimedia and Web Design (MWD)

NAME: name a section of the page TARGET = "_blank" "_parent" "_self" "_top" window name which window the document should go in

HTML CS 4640 Programming Languages for Web Applications

2.1 Origins and Evolution of HTML

Summary 4/5. (contains info about the html)

Copyright 2012 Disruptive Innovations SAS - All rights reserved.

LING 408/508: Computational Techniques for Linguists. Lecture 14

WCAG2-AAA accessibility report for

1.264 Lecture 12. HTML Introduction to FrontPage

Html basics Course Outline

!Accessibility Issues Found

A Balanced Introduction to Computer Science, 3/E

Transcription:

Evaluation of alignment methods for HTML parallel text 1 Enrique Sánchez-Villamil, Susana Santos-Antón, Sergio Ortiz-Rojas, Mikel L. Forcada Transducens Group, Departament de Llenguatges i Sistemes Informàtics Universitat d Alacant, E-03071 Alacant (Spain) FinTAL Turku/Åbo August 23, 2006 1 Funded through Spanish Govt. grant TIC2003-08681-C02-01.

Contents 1 Introduction 2 3 4 5

Parallel texts on the Internet Parallel texts on the Internet The need for alignment Using document structure to align Corpus-based machine translation relies on the availability of large parallel text corpora. The Internet contains abundant parallel text that may be harvested for that purpose.

The need for alignment Parallel texts on the Internet The need for alignment Using document structure to align In particular, translation learning or computer-aided translation usually require aligned corpora (for instance, sentence-aligned corpora). Aligners usually require the left and right text to be previously segmented in sentences. Geometrical aligners use the relationships between the lengths of left and right sentences which are mutual translation (Brown et al. 1991, Gale & Church, 1991, 1993). Alignments may improve if linguistic (e.g. lexical) information is added to them.

Parallel texts on the Internet The need for alignment Using document structure to align Using document structure to align Our focus is on aligners which are linguistically independent (no linguistic information is required) but take advantage of document structure (HTML tag structure) Our main hypothesis: taking document structure into account improves alignment.

Geometric alignment Sentence splitting Geometric alignment: edit distance /1 L = (l 1, l 2,..., l L ), left text split in L segments. R = (r 1, r 2,..., r R ), left text split in R segments. R may be obtained from L through a sequence S = (s 1, s 2,..., s S ) of edit operations: segment insertions, segment deletions, or segment substitutions

Geometric alignment Sentence splitting Geometric alignment: edit distance /2 For L and R there is an edit operation sequence A(L, R) = S = (s1, s 2,..., s S ), which is optimal in the sense that S D(S ) = abs( l i r i ) i=1 is the minimum over all possible edit operation sequences.

Sentence splitting heuristics Geometric alignment Sentence splitting Sentence splitting is performed at!,? and.. For. s, validate using a threshold ( 0.2) and empirical scores: Characters before Characters after Points - number 0.5 - a blank +0.5 - a non-capital letter 0.2 - another. 0.5 - blank + capital letter +0.5 - a blank + non-capital letter 0.2 a capital letter - 0.5 a word of 3 characters or less - 0.5 a blank - +0.2 a or " character a or " character 0.5 another. - +0.4

Geometric alignment of (X)HTML texts Availability Geometric alignment of (X)HTML texts /1 Two baseline approaches (for comparison): Remover: remove tags, split in sentences, and align. Replacer: replace hr, br, p, li, ol, ul, tr, td, th and div by sentence boundaries, split, and align. These baselines will be used to evaluate a family of tag-driven alignment algorithms.

Geometric alignment of (X)HTML texts Availability Geometric alignment of (X)HTML texts /2 We use the following classification of (X)HTML tags: Structural tags: blockquote, body, caption, col, colgroup, dd, dir, div, dl, dt, h1 h6, head, hr, html, li, menu, noframes, noscript, ol, optgroup, option, p, q, select, table, tbody, td, tfoot, th, thead, tr, ul. Format tags: abbr, acronym, b, big, center, cite, code, dfn, em, font, i, pre, s, small, span, strike, strong, style, sub, sup, tt, u. Content tags: a, area, fieldset, form, iframe, img, input, isindex, label, legend, map, object, param, textarea, title. Irrelevant tags (will be removed): address, applet, base, basefont, bdo, br, button, del, ins, kbd, link, meta, samp, script, var.

Geometric alignment of (X)HTML texts Availability Geometric alignment of (X)HTML texts /3 How is tag information used to align? Forbidden alignments: Tag text segment (and vice versa) Structural tag tag of another class Format tag tag of another class Content tag different content tag Structural tag structural tag alignment expensive unless they are the same tag. Format tag format tag alignment cheap The cost of text alignments should be lower than tag tag costs.

Geometric alignment of (X)HTML texts Availability Geometric alignment of (X)HTML texts /4 Costs of edit operations (empirically adjusted): Insert <strct> <frmt> <cntnt> Text Delete - 1 0.75 1.25 0.01 l <strct> 1 1.5 1.75 H H <frmt> 0.75 1.75 0.4 H H <cntnt> 1.25 H H H H Text 0.01 r H H H H is large enough for that operation to be always avoided will be different in each variant of the algorithm (next slide)

Geometric alignment of (X)HTML texts Availability Geometric alignment of (X)HTML texts /5 Three variants of the tag-driven alignment: 2-in-1: Split texts in tags and text segments and then align. Uses = 0.015 (abs( l r )) (factor 0.015 empirically obtained) 2-steps-L: Split text by tags; align tags and text segments, split text segments in sentences and align. Uses = 0.015 (abs( l r )). 2-steps-AD: Same as 2-steps-L but uses = 0.01 D(A(l, r)) (the alignment distance between the text segments; factor 0.01 empirically determined).

Availability Introduction Geometric alignment of (X)HTML texts Availability An implementation of the tag-driven aligners described is available as open-source software (under the GPL license) from tag-aligner.sourceforge.net.

Corpora Introduction Corpora Reference alignment Metrics Results Three corpora: 86.1 MB of parallel HTML downloaded using Bitextor (http://www.sf.net/projects/bitextor/) from the bilingual es ca daily http://www.elperiodico.com ; [ easy : 2.14% sentences align to null]. a small fragment of the Quixote (196 kb: en, es) from a digital Library, http://www.cervantesvirtual.com/ [ medium : 19.08% sentences align to null] Help texts of the mirc program (96 kb: es, pt, it, ca, gl). [ hard : 26.42% sentences aligned to null]

Corpora Reference alignment Metrics Results Building a reference alignment Reference alignment obtained by post-editing the automatic alignment: Human editor corrects incorrect alignments into correct ones. Human editor seldom splits ( refines ) a correct alignment. Just in case, concatenation of reference alignments will be allowed during evaluation (not common).

Metrics Introduction Corpora Reference alignment Metrics Results precision = recall = F = 2 # correct alignments # proposed alignments # correct alignments # reference alignments recall precision recall + precision

Results: precision Corpora Reference alignment Metrics Results

Results: recall Corpora Reference alignment Metrics Results

Results: F-measure Corpora Reference alignment Metrics Results

Concluding remarks Concluding remarks Future work Aligners using (X)HTML tag information are clearly better than the basic geometric aligners...... at only a twofold increase in processing time. The best is the 2-in-1 aligner (the one that splits texts at sentence boundaries and tags and then aligns). As expected, results are worse for harder bitext corpora.

Future work Introduction Concluding remarks Future work Incorporating the same tag-based strategies into existing open-source computer-aided translation tools: the bitext2tmx text aligner (bitext2tmx.sf.net). the OmegaT computer-aided translation tool (omegat.sf.net) Extending the aligner to other XML-based formats (e.g., DocBook, OpenDocument). Task-oriented evaluation of automatically-generated TMX files in real computer-aided translation applications. Developers welcome at tag-aligner.sf.net!