HTML: Parsing Library

Similar documents
HTML: Parsing Library

Evaluation of alignment methods for HTML parallel text 1

Oliver Pott HTML XML. new reference. Markt+Technik Verlag

The [HTML] Element p. 61 The [HEAD] Element p. 62 The [TITLE] Element p. 63 The [BODY] Element p. 66 HTML Elements p. 66 Core Attributes p.

Index. CSS directive, # (octothorpe), intrapage links, 26

5-Sep-16 Copyright 2016 by GemTalk Systems LLC 1

HTML TAG SUMMARY HTML REFERENCE 18 TAG/ATTRIBUTE DESCRIPTION PAGE REFERENCES TAG/ATTRIBUTE DESCRIPTION PAGE REFERENCES MOST TAGS

ROLE OF WEB BROWSING LAYOUT ENGINE EVALUATION IN DEVELOPMENT

Designing UI. Mine mine-cetinkaya-rundel

QUICK REFERENCE GUIDE

"utf-8";

HTML Markup for Accessibility You Never Knew About

Canvas & Brush Reference. Source: stock.xchng, Maarten Uilenbroek

@namespace url( /* set default namespace to HTML */ /* bidi */

Wireframe :: tistory wireframe tistory.

COPYRIGHTED MATERIAL. Contents. Chapter 1: Creating Structured Documents 1

Chapter 2:- Introduction to XHTML. Compiled By:- Sanjay Patel Assistant Professor, SVBIT.

Name Related Elements Type Default Depr. DTD Comment

WML2.0 TUTORIAL. The XHTML Basic defined by the W3C is a proper subset of XHTML, which is a reformulation of HTML in XML.

Appendix A. XHTML 1.1 Module Reference

CPET 499/ITC 250 Web Systems. Topics

Certified HTML Designer VS-1027

SilkTest 2009 R2. Rules for Object Recognition

CHAPTER 7 USER INTERFACE MODEL

Silk Test Object Recognition with the Classic Agent

HTML Cheat Sheet for Beginners

<page> 1 Document Summary Document Information <page> 2 Document Structure Text Formatting <page> 3 Links Images <page> 4

COPYRIGHTED MATERIAL. Contents. Introduction. Chapter 1: Structuring Documents for the Web 1

HTML Summary. All of the following are containers. Structure. Italics Bold. Line Break. Horizontal Rule. Non-break (hard) space.

Deccansoft Software Services

Go.Web Style Guide. Oct. 16, Hackensack Ave Hackensack, NJ GoAmerica, Inc. All rights reserved.

Structure Bars. Tag Bar

Electronic Books. Lecture 6 Ing. Miloslav Nič Ph.D. letní semestr BI-XML Miloslav Nič, 2011

A HTML document has two sections 1) HEAD section and 2) BODY section A HTML file is saved with.html or.htm extension

Certified HTML5 Developer VS-1029

Internet publishing HTML (XHTML) language. Petr Zámostný room: A-72a phone.:

UNIT II Dynamic HTML and web designing

Winwap Technologies Oy. WinWAP Browser. Application Environment

BACKGROUND. HTTP is a 2-phase protocol used by most web applications and all web browsers. The response is usually an HTML document

Beginning Web Programming with HTML, XHTML, and CSS. Second Edition. Jon Duckett

HTML BEGINNING TAGS. HTML Structure <html> <head> <title> </title> </head> <body> Web page content </body> </html>

Cascading Style Sheet

Static Webpage Development

Bye Bye. Warning! HTML5 The Future of HTML. Saturday, October 9, All of this may change! The spec is still in development!

HTML5. The Future of HTML. Washington University in St. Louis R. Scott Granneman

Symbols INDEX. !important rule, rule, , 146, , rule,

Programming of web-based systems Introduction to HTML5

Inline Elements Karl Kasischke WCC INP 150 Winter

Table-Based Web Pages

13.8 How to specify alternate text

Advanced Web Programming C2. Basic Web Technologies

GESTIONE DEL BACKGROUND. Programmazione Web 1

HTML and CSS COURSE SYLLABUS

ATL Transformation Example

Continues the Technical Activities Originated in the WAP Forum

Internet Explorer HTML 4.01 Standards Support Document

Web development using PHP & MySQL with HTML5, CSS, JavaScript

CSI 3140 WWW Structures, Techniques and Standards. Markup Languages: XHTML 1.0

Chapter 1 Self Test. LATIHAN BAB 1. tjetjeprb{at}gmail{dot}com. webdesign/favorites.html :// / / / that houses that information. structure?

HTML. HTML is now very well standardized, but sites are still not getting it right. HTML tags adaptation

Creating and Setting Up the Initial Content

MSEP Partner Portal JSON Feed Standards

A Brief Introduction to HTML

Scripting for Multimedia LECTURE 1: INTRODUCING HTML5

Web Development & Design Foundations with HTML5 & CSS3 Instructor Materials Chapter 2 Test Bank

IMS Question and Test Interoperability Information Model v2.0

!Accessibility Issues Found

Announcements. Paper due this Wednesday

CHAPTER 2 MARKUP LANGUAGES: XHTML 1.0

Portia Documentation. Release Scrapinghub

Basics of Web Design, 3 rd Edition Instructor Materials Chapter 2 Test Bank

NAME: name a section of the page TARGET = "_blank" "_parent" "_self" "_top" window name which window the document should go in

Review of HTML. Chapter Pearson. Fundamentals of Web Development. Randy Connolly and Ricardo Hoar

XHTML Mobile Profile. Candidate Version Feb Open Mobile Alliance OMA-TS-XHTMLMP-V1_ C

2.1 Origins and Evolution of HTML

HTML 4.0 Specification

WCAG2-AAA accessibility report for

UNIT-02 Hyper Text Markup Language (HTML) UNIT-02/LECTURE-01 Introduction to Hyper Text Markup Language (HTML) About HTML: [RGPV/Dec 2013(4)]

Non element markup constructs

Tables & Lists. Organized Data. R. Scott Granneman. Jans Carton

Evaluation of Different Approaches to Training a Genre Classifier

Skill Area 323: Design and Develop Website. Multimedia and Web Design (MWD)

HTML. HTML Evolution

Style-Guidelines. Release 0.1

Web Development and Design Foundations with HTML5 8th Edition

Tutorial 2 - HTML basics

HTML CS 4640 Programming Languages for Web Applications

WCAG2-AA accessibility report for

WCAG2-AA accessibility report for

Summary 4/5. (contains info about the html)

WCAG2-AA accessibility report for

COMS W3101: SCRIPTING LANGUAGES: JAVASCRIPT (FALL 2017)

1/6/ :28 AM Approved New Course (First Version) CS 50A Course Outline as of Fall 2014

HTML & CSS: The Complete Reference, Fifth Edition

MOVING TOWARD XHTML 2

Government of Karnataka Department of Technical Education Bengaluru

COMP519: Web Programming Lecture 4: HTML (Part 3)

LA TROBE UNIVERSITY SEMESTER ONE EXAMINATION PERIOD CAMPUS AW BE BU MI SH ALLOWABLE MATERIALS

Web Development and HTML. Shan-Hung Wu CS, NTHU

Digital Asset Management 2. Introduction to Digital Media Format

Transcription:

HTML: Parsing Library Version 4.1.3 November 20, 2008 (require html) The html library provides functions to read html documents and structures to represent them. (read-xhtml port) html? port : input-port? (read-html port) html? port : input-port? Reads (X)HTML from a port, producing an html instance. (read-html-as-xml port) (listof content?) port : input-port? Reads HTML from a port, producing an xexpr compatible with the xml library (which defines content?). 1

1 Example (module html-example scheme ; Some of the symbols in html and xml conflict with ; each other and with scheme/base language, so we prefix ; to avoid namespace conflict. (require (prefix-in h: html) (prefix-in x: xml)) (define an-html (h:read-xhtml (open-input-string (string-append "<html><head><title>my title</title></head><body>" "<p>hello world</p><p><b>testing</b>!</p>" "</body></html>")))) ; extract-pcdata: html-content -> (listof string) ; Pulls out the pcdata strings from some-content. (define (extract-pcdata some-content) (cond [(x:pcdata? some-content) (list (x:pcdata-string some-content))] [(x:entity? some-content) (list)] [else (extract-pcdata-from-element some-content)])) ; extract-pcdata-from-element: html-element -> (listof string) ; Pulls out the pcdata strings from an-html-element. (define (extract-pcdata-from-element an-html-element) (match an-html-element [(struct h:html-full (content)) (apply append (map extract-pcdata content))] [(struct h:html-element (attributes)) ()])) (printf " s n" (extract-pcdata an-html))) > (require html-example) ("My title" "Hello world" "Testing" "!") 2

2 HTML Structures pcdata, entity, and attribute are defined in the xml documentation. A html-content is either html-element pcdata entity (struct html-element (attributes)) attributes : (listof attribute) Any of the structures below inherits from html-element. (struct (html-full struct:html-element) (content)) content : (listof html-content) Any html tag that may include content also inherits from html-full without adding any additional fields. (struct (html html-full) ()) A html is (make-html (listof attribute) (listof Contents-of-html)) A Contents-of-html is either body head (struct (div html-full) ()) A div is (make-div (listof attribute) (listof G2)) (struct (center html-full) ()) A center is (make-center (listof attribute) (listof G2)) 3

(struct (blockquote html-full) ()) A blockquote is (make-blockquote (listof attribute) G2) (struct (ins html-full) ()) An Ins is (make-ins (listof attribute) (listof G2)) (struct (del html-full) ()) A del is (make-del (listof attribute) (listof G2)) (struct (dd html-full) ()) A dd is (make-dd (listof attribute) (listof G2)) (struct (li html-full) ()) A li is (make-li (listof attribute) (listof G2)) (struct (th html-full) ()) A th is (make-th (listof attribute) (listof G2)) (struct (td html-full) ()) A td is (make-td (listof attribute) (listof G2)) (struct (iframe html-full) ()) An iframe is (make-iframe (listof attribute) (listof G2)) (struct (noframes html-full) ()) A noframes is (make-noframes (listof attribute) (listof G2)) (struct (noscript html-full) ()) 4

A noscript is (make-noscript (listof attribute) (listof G2)) (struct (style html-full) ()) A style is (make-style (listof attribute) (listof pcdata)) (struct (script html-full) ()) A script is (make-script (listof attribute) (listof pcdata)) (struct (basefont html-element) ()) A basefont is (make-basefont (listof attribute)) (struct (br html-element) ()) A br is (make-br (listof attribute)) (struct (area html-element) ()) An area is (make-area (listof attribute)) (struct (alink html-element) ()) A alink is (make-alink (listof attribute)) (struct (img html-element) ()) An img is (make-img (listof attribute)) (struct (param html-element) ()) A param is (make-param (listof attribute)) (struct (hr html-element) ()) A hr is (make-hr (listof attribute)) 5

(struct (input html-element) ()) An input is (make-input (listof attribute)) (struct (col html-element) ()) A col is (make-col (listof attribute)) (struct (isindex html-element) ()) An isindex is (make-isindex (listof attribute)) (struct (base html-element) ()) A base is (make-base (listof attribute)) (struct (meta html-element) ()) A meta is (make-meta (listof attribute)) (struct (option html-full) ()) An option is (make-option (listof attribute) (listof pcdata)) (struct (textarea html-full) ()) A textarea is (make-textarea (listof attribute) (listof pcdata)) (struct (title html-full) ()) A title is (make-title (listof attribute) (listof pcdata)) (struct (head html-full) ()) A head is (make-head (listof attribute) (listof Contents-of-head)) A Contents-of-head is either base 6

isindex link meta object script style title (struct (tr html-full) ()) A tr is (make-tr (listof attribute) (listof Contents-of-tr)) A Contents-of-tr is either td th (struct (colgroup html-full) ()) A colgroup is (make-colgroup (listof attribute) (listof col)) (struct (thead html-full) ()) A thead is (make-thead (listof attribute) (listof tr)) (struct (tfoot html-full) ()) A tfoot is (make-tfoot (listof attribute) (listof tr)) (struct (tbody html-full) ()) A tbody is (make-tbody (listof attribute) (listof tr)) (struct (tt html-full) ()) 7

A tt is (make-tt (listof attribute) (listof G5)) (struct (i html-full) ()) An i is (make-i (listof attribute) (listof G5)) (struct (b html-full) ()) A b is (make-b (listof attribute) (listof G5)) (struct (u html-full) ()) An u is (make-u (listof attribute) (listof G5)) (struct (s html-full) ()) A s is (make-s (listof attribute) (listof G5)) (struct (strike html-full) ()) A strike is (make-strike (listof attribute) (listof G5)) (struct (big html-full) ()) A big is (make-big (listof attribute) (listof G5)) (struct (small html-full) ()) A small is (make-small (listof attribute) (listof G5)) (struct (em html-full) ()) An em is (make-em (listof attribute) (listof G5)) (struct (strong html-full) ()) A strong is (make-strong (listof attribute) (listof G5)) 8

(struct (dfn html-full) ()) A dfn is (make-dfn (listof attribute) (listof G5)) (struct (code html-full) ()) A code is (make-code (listof attribute) (listof G5)) (struct (samp html-full) ()) A samp is (make-samp (listof attribute) (listof G5)) (struct (kbd html-full) ()) A kbd is (make-kbd (listof attribute) (listof G5)) (struct (var html-full) ()) A var is (make-var (listof attribute) (listof G5)) (struct (cite html-full) ()) A cite is (make-cite (listof attribute) (listof G5)) (struct (abbr html-full) ()) An abbr is (make-abbr (listof attribute) (listof G5)) (struct (acronym html-full) ()) An acronym is (make-acronym (listof attribute) (listof G5)) (struct (sub html-full) ()) A sub is (make-sub (listof attribute) (listof G5)) (struct (sup html-full) ()) A sup is (make-sup (listof attribute) (listof G5)) 9

(struct (span html-full) ()) A span is (make-span (listof attribute) (listof G5)) (struct (bdo html-full) ()) A bdo is (make-bdo (listof attribute) (listof G5)) (struct (font html-full) ()) A font is (make-font (listof attribute) (listof G5)) (struct (p html-full) ()) A p is (make-p (listof attribute) (listof G5)) (struct (h1 html-full) ()) A h1 is (make-h1 (listof attribute) (listof G5)) (struct (h2 html-full) ()) A h2 is (make-h2 (listof attribute) (listof G5)) (struct (h3 html-full) ()) A h3 is (make-h3 (listof attribute) (listof G5)) (struct (h4 html-full) ()) A h4 is (make-h4 (listof attribute) (listof G5)) (struct (h5 html-full) ()) A h5 is (make-h5 (listof attribute) (listof G5)) (struct (h6 html-full) ()) 10

A h6 is (make-h6 (listof attribute) (listof G5)) (struct (q html-full) ()) A q is (make-q (listof attribute) (listof G5)) (struct (dt html-full) ()) A dt is (make-dt (listof attribute) (listof G5)) (struct (legend html-full) ()) A legend is (make-legend (listof attribute) (listof G5)) (struct (caption html-full) ()) A caption is (make-caption (listof attribute) (listof G5)) (struct (table html-full) ()) A table is (make-table (listof attribute) (listof Contents-of-table)) A Contents-of-table is either caption col colgroup tbody tfoot thead (struct (button html-full) ()) A button is (make-button (listof attribute) (listof G4)) (struct (fieldset html-full) ()) 11

A fieldset is (make-fieldset (listof attribute) (listof Contents-offieldset)) A Contents-of-fieldset is either legend G2 (struct (optgroup html-full) ()) An optgroup is (make-optgroup (listof attribute) (listof option)) (struct (select html-full) ()) A select is (make-select (listof attribute) (listof Contents-of-select)) A Contents-of-select is either optgroup option (struct (label html-full) ()) A label is (make-label (listof attribute) (listof G6)) (struct (form html-full) ()) A form is (make-form (listof attribute) (listof G3)) (struct (ol html-full) ()) An ol is (make-ol (listof attribute) (listof li)) (struct (ul html-full) ()) An ul is (make-ul (listof attribute) (listof li)) 12

(struct (dir html-full) ()) A dir is (make-dir (listof attribute) (listof li)) (struct (menu html-full) ()) A menu is (make-menu (listof attribute) (listof li)) (struct (dl html-full) ()) A dl is (make-dl (listof attribute) (listof Contents-of-dl)) A Contents-of-dl is either dd dt (struct (pre html-full) ()) A pre is (make-pre (listof attribute) (listof Contents-of-pre)) A Contents-of-pre is either G9 G11 (struct (object html-full) ()) An object is (make-object (listof attribute) (listof Contents-of-objectapplet)) (struct (applet html-full) ()) An applet is (make-applet (listof attribute) (listof Contents-of-objectapplet)) A Contents-of-object-applet is either 13

param G2 (struct (map html-full) ()) A Map is (make-map (listof attribute) (listof Contents-of-map)) A Contents-of-map is either area fieldset form isindex G10 (struct (a html-full) ()) An a is (make-a (listof attribute) (listof Contents-of-a)) A Contents-of-a is either label G7 (struct (address html-full) ()) An address is (make-address (listof attribute) (listof Contents-ofaddress)) A Contents-of-address is either p G5 14

(struct (body html-full) ()) A body is (make-body (listof attribute) (listof Contents-of-body)) A Contents-of-body is either del ins G2 A G12 is either button iframe input select textarea A G11 is either a label G12 A G10 is either address blockquote center dir div dl h1 15

h2 h3 h4 h5 h6 hr menu noframes noscript ol p pre table ul A G9 is either abbr acronym b bdo br cite code dfn em i kbd map 16

pcdata q s samp script span strike strong tt u var A G8 is either applet basefont big font img object small sub sup G9 A G7 is either G8 G12 17

A G6 is either a G7 A G5 is either label G6 A G4 is either G8 G10 A G3 is either fieldset isindex G4 G11 A G2 is either form G3 18