An HTML5 Conformance Checker

uglyveinInternet and Web Development

Jun 24, 2012 (5 years and 1 month ago)

983 views

HELSINKI UNIVERSITY OF TECHNOLOGY
Department of Computer Science and Engineering
Laboratory of Software Technology
Henri Sivonen
An
HTML5
Conformance Checker
Master’s
Thesis
submitted
in
partial
fulfillment
of
the
requirements
for
the
degree of Master of Science in Technology.
Helsinki, 7 May 2007
Supervisor and Instructor: Professor Jorma Tarhio
© 2006–2007 Henri Sivonen
Digital versions of this thesis (including the source files) may be obtained from:
http://hsivonen.iki.fi/thesis/
This
literary
work
(“Work”)
is
licensed
under
the
Creative
Commons
Attribution-
ShareAlike
license
version
2.5
or
later
(“License”).
The
license
text
is
available
from
http://creativecommons.org/licenses/by-sa/2.5/
.
You
may
have
received
the
Work
aggregated
in
a
digital
file
or
on
a
tangible
medium
together
with
a
Creative
Commons
license
badge
graphic
and/
or
the
“wing”
emblem
of
Helsinki
University
of
Technology,
and/
or
you
may
have
received
the
Work
in
a
digital
file
that
contains
embedded
fonts.
The
license
badge,
the
“wing”
emblem
and
the
embedded
fonts
are
not
part
of
the
Work
and
are
not
covered
by
the
License.
For
avoidance
of
doubt,
when
the
Work
or
a
Derivative
Work
is
distributed
as
a
file
containing
embedded
fonts
(e.g.
PDF)
or
as
a
markup
document
accompanied
by
external
style
definitions
and
the
markup
document
can
be
intelligibly
rendered
using
the
default
style
definitions
of
typical
rendering
software
(e.g.
semantic
HTML
or
L
A
T
E
X
using
standard
macro
names),
the
embedded
fonts
or
the
accompanying
external
style
definitions
are
not
considered
to
be
subject
to
the
ShareAlike
provision
of
the
License by the Licensor and are not required to be licensed under the License.
The
Licensor
believes
that
the
license
badge,
the
“wing”
emblem
and
the
embedded
fonts
do
not
encumber
exercising
the
rights
granted
under
the
License
for
distribution
of
verbatim
copies
of
the
Work.
However,
if
you
create
a
Derivative
Work,
please
ensure
that
you
are
permitted
to
use
the
elements
that
are
not
covered
by
the
License or delete the elements if in doubt.
Please
refer
to
http://creativecommons.org/
policies
for
information
about
the
use
of
the
license
badge.
If
you have inquiries about the “wing” emblem, please contact
Helsinki University of Technology
.
If
you
create
a
Derivative
Work,
please
make
it
clear
that
it
is
not
the
original
version
and
that
your
modific-
ation
were
not
made
by
the
original
author.
In
addition,
please
do
not
call
a
Derivative
Work
a
master’s
thesis
written
at
Helsinki
University
of
Technology.
Saying
that
the
Derivate
Work
is
based
on
a
master’s
thesis
written
at Helsinki University of Technology would be appreciated, though.
ii
Author:
Henri Sivonen
Department:
Computer Science and Engineering
Major:
Software Systems
Minor:
Strategy and International Business
Title of the thesis:
An
HTML5
Conformance Checker
Number of pages:
xiv + 108
Date:
7 May, 2007
Professorship:
T-106 Software Technology
Supervisor:
Professor Jorma Tarhio
Instructors:
Professor Jorma Tarhio
The
Web
Hypertext
Application
Technology
Working
Group
(
WHATWG
)
is
de-
veloping
HTML5
and
its
parallel
XML
version,
XHTML5
,
as
successors
for
HTML
4.01
and
XHTML
1.0.
An
(X)HTML5
conformance
checker
is
expected
to
take
the
role
that
DTD
-based
validators
have
had
with
earlier
(X)HTML
.
Conformance
checking
goes
beyond
the
capabilities
of
DTD
s.
The
WHATWG
does
not
prescribe
an
implementation
strategy
for
conformance
checkers
and
does
not
endorse
schema languages.
Realizing
that
no
schema
language
is
adequate
for
describing
the
conformance
requirements
for
(X)HTML5
,
a
mainly
RELAX
NG-based
implementation
ap-
proach
was
chosen
nonetheless
for
this
project.
In
this
project,
the
bulk
of
the
(X)HTML5
language
is
described
as
a
RELAX
NG
schema
that
is
supported
by
a
custom
datatype
library
written
in
Java.
A
Schematron
schema
is
used
alongside
RELAX
NG
for
enforcing
constraints
for
which
RELAX
NG
is
not
suitable.
The
re-
maining
requirements
are
enforced
by
custom
code
written
in
Java.
For
checking
HTML5
,
which
is
a
language
on
its
own
and
is
not
an
SGML
or
XML
vocabulary,
a
special-purpose
parser
was
developed
so
that
the
XML
tools
can
work
on
XHTML5
-like parse events.
The
design
of
the
system
is
discussed
and
found
to
be
successful.
The
ease
of
expressing
and
changing
the
grammar
is
identified
as
the
main
benefit
of
RELAX
NG.
The
inability
to
easily
fine-tune
error
messages
is
identified
as
a
drawback.
Schematron
is
found
to
be
more
suitable
than
RELAX
NG
for
expressing
exclu-
sions
and
referential
integrity
constraints.
A
checker
for
checking
the
integrity
of
HTML
tables
is
presented
as
the
main
example
of
a
non-schema-based
checker
implemented in Java.
Keywords:
HTML5
, conformance checking,
HTML
, validation,
XHTML
,
XML
,
WHATWG
, RELAX NG, Schematron,
SAX
,
Web
HELSINKI UNIVERSITY OF TECHNOLOGY
ABSTRACT OF MASTER’S THESIS
iii
Tekijä:
Henri Sivonen
Osasto:
Tietotekniikka
Pääaine:
Ohjelmistojärjestelmät
Sivuaine:
Yritysstrategia ja kansainvälinen liiketoiminta
Työn nimi:
HTML5
-konformanssitarkistin
Sivumäärä:
xiv + 108
Päiväys:
7. toukokuuta 2007
Professuuri:
T-106 Ohjelmistotekniikka
Työn valvoja:
Professori Jorma Tarhio
Työn ohjaajat:
Professori Jorma Tarhio
Web
Hypertext
Application
Technology
Working
Group
(
WHATWG
)
kehittää
HTML5
:tä
ja
sen
rinnakkaista
XML
-versiota,
XHTML5
:tä,
HTML
4.01:n
and
XHTML
1.0:n
seuraajiksi.
(X)HTML5
-konformanssitarkistimen
odotetaan
ottavan
rooli,
joka
DTD
-pohjaisilla
validaattoreilla
on
ollut
aiemman
(X)HTML
:n
kohdal-
la.
Konformanssitarkistus
menee
DTD
:iden
kykyjä
pidemmälle.
WHATWG
ei
määrää toteutusstrategiaa konformanssitarkistimille eikä tue mitään skeemakieliä.
Vaikka
mikään
skeemakieli
ei
ole
riittävä
kuvaamaan
(X)HTML5
:n
konfor-
manssivaatimuksia,
pääasiassa
RELAX
NG
-pohjainen
toteutustapa
valittiin
tähän
projektiin
siitä
huolimatta.
Tässä
projektissa
valtaosa
(X)HTML5
-kielestä
kuva-
taan
RELAX
NG-skeemana,
jota
tukee
Javalla
kirjoitettu
räätälöity
datatyyppikir-
jasto.
Schematron-skeemaa
käytetään
RELAX
NG:n
ohella
valvomaan
rajoitteita,
joihin
RELAX
NG
ei
sovellu.
Jäljelle
jääviä
rajoitteita
valvotaan
räätälöidyllä
Java-
koodilla.
HTML5
:n,
joka
on
itsenäinen
kieli
eikä
SGML
-
tai
XML
-sanasto,
tarkista-
miseen
kehitettiin
tätä
tarkoitusta
varten
jäsennin,
jotta
XML
-työkalut
voivat
kuunnella
XHTML5
:n kaltaisia jäsennystapahtumia.
Järjestelmän
suunnitteluratkaisuja
käsitellään
ja
ne
todetaan
onnistuneiksi.
Kieliopin
ilmaisemisen
ja
muuttamisen
helppous
tunnistetaan
RELAX
NG:n
pää-
eduksi.
Kykenemättömyys
virheilmoitusten
helppoon
hienosäätöön
tunnistetaan
haitaksi.
Schematron
todetaan
RELAX
NG:tä
soveltuvammaksi
ekskluusioiden
ja
viite-eheysrajoitteiden
ilmaisuun.
Tarkistin
HTML
-taulukoiden
eheyden
tarkista-
miseen
esitellään
pääesimerkkinä
Javalla
toteutetusta
ei-skeemapohjaisesta
tarkistimesta.
Avainsanat:
HTML5
, konformanssitarkistus,
HTML
, validointi,
XHTML
,
XML
,
WHATWG
, RELAX NG, Schematron,
SAX
, Web
TEKNILLINEN KORKEAKOULU
DIPLOMITYÖN TIIVISTELMÄ
iv
Acknowledgements
This
Master’s
thesis
has
been
written
at
the
Laboratory
of
Software
Technology
of
Helsinki University of Technology.
I
want
to
thank
Ian
Hickson
for
all
his
work
on
HTML5
,
without
which
this
thesis would not exist.
I
wish
to
thank
Elika
Etemad
(fantasai)
for
developing
the
core
RELAX
NG
schema
for
HTML5
,
for
letting
me
build
upon
the
schema,
for
reviewing
and
com-
menting on my changes to the schema, and for reviewing drafts of this thesis.
I
would
also
like
to
thank
the
Mozilla
Foundation
for
funding
this
project
and
Frank Hecker of the Mozilla Foundation for supporting this project.
I
want
to
thank
James
Clark
for
developing
the
Jing
validation
engine
that
the
software developed in this project is based on.
My
gratitude
also
goes
to
members
of
the
#turska
and
#whatwg
IRC
channels
as well as the members of the
WHATWG
mailing list.
I
would
like
to
thank
YesLogic
Pty.
Ltd.,
SyncRO
Soft
Ltd.
and
Oskar
Ojala
for
software that I used to make this thesis publishable.
I wish to thank my instructor and supervisor professor Jorma Tarhio.
Finally, I would like to thank my family.
Helsinki, 7 May 2007
Henri Sivonen
v
Contents
1
Introduction
1
1.1
Motivation
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2
Objectives
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.3
Methods
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.4
Availability of the Software
. . . . . . . . . . . . . . . . . . . . . . .
3
1.5
Organization of this Thesis
. . . . . . . . . . . . . . . . . . . . . . . .
3
2
History of HTML Leading to HTML5
5
2.1
Early HTML
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
2.1.1
Initial HTML at CERN
. . . . . . . . . . . . . . . . . . . . . .
5
2.1.2
The IIIR Draft
. . . . . . . . . . . . . . . . . . . . . . . . . . .
6
2.1.3
HTML+
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
2.1.4
HTML 2.0
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
2.1.5
HTML 3.0
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.1.6
HTML 3.2
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.2
Contemporary HTML
. . . . . . . . . . . . . . . . . . . . . . . . . .
9
2.2.1
HTML 4
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2.2.2
ISO HTML
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
2.2.3
XHTML 1.0
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
2.2.4
Modularization
. . . . . . . . . . . . . . . . . . . . . . . . . .
11
2.3
HTML5
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
2.3.1
The Mozilla/Opera Joint Position Paper
. . . . . . . . . . . .
13
2.3.2
The WHATWG is Formed
. . . . . . . . . . . . . . . . . . . .
14
2.3.3
The WHATWG Specifications
. . . . . . . . . . . . . . . . . .
14
3
Schema Languages
17
3.1
DTDs
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
3.1.1
Infoset Augmentation
. . . . . . . . . . . . . . . . . . . . . . .
18
3.1.2
Datatyping
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
3.1.3
Other Problems with DTDs
. . . . . . . . . . . . . . . . . . .
19
3.2
W3C XML Schema
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
3.3
Document Structure Description
. . . . . . . . . . . . . . . . . . . .
20
3.4
TREX, RELAX, XDuce and DDML
. . . . . . . . . . . . . . . . . . .
21
3.5
RELAX NG
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
3.5.1
Datatyping
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
3.5.2
Compact Syntax
. . . . . . . . . . . . . . . . . . . . . . . . . .
22
3.5.3
Use in This Project
. . . . . . . . . . . . . . . . . . . . . . . . .
23
vi
3.6
Schematron
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
3.6.1
Using RELAX NG and Schematron Together
. . . . . . . . .
23
3.6.2
Use in This Project
. . . . . . . . . . . . . . . . . . . . . . . . .
24
4
Prior Work on Markup Checking
25
4.1
The W3C Markup Validation Service
. . . . . . . . . . . . . . . . . .
25
4.2
WDG HTML Validator
. . . . . . . . . . . . . . . . . . . . . . . . . .
26
4.3
Page Valet
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
4.4
The Schneegans XML Schema Validator
. . . . . . . . . . . . . . . .
27
4.5
Relaxed
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
4.6
The Feed Validator
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
4.7
Validome
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
5
Implementation
31
5.1
The SAX API
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
5.2
The HTML Parser
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
5.2.1
HTML5 as an Alternative Infoset Serialization
. . . . . . . .
33
5.2.2
TagSoup
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
5.2.3
Parser Design
. . . . . . . . . . . . . . . . . . . . . . . . . . .
34
5.2.4
Minor Problems
. . . . . . . . . . . . . . . . . . . . . . . . . .
34
5.3
Front End
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
5.4
Back End Design
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
5.5
The Jing Validation Engine
. . . . . . . . . . . . . . . . . . . . . . . .
37
5.6
The RELAX NG Schema
. . . . . . . . . . . . . . . . . . . . . . . . .
38
5.6.1
The General Schema Design
. . . . . . . . . . . . . . . . . . .
39
5.6.2
Common Definitions
. . . . . . . . . . . . . . . . . . . . . . .
39
5.6.3
Examples of Elements
. . . . . . . . . . . . . . . . . . . . . .
42
5.7
The HTML5 Datatype Library
. . . . . . . . . . . . . . . . . . . . . .
43
5.7.1
Dates
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
5.7.2
IRIs
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
5.7.3
Language Tags
. . . . . . . . . . . . . . . . . . . . . . . . . . .
45
5.7.4
ECMAScript Regular Expressions
. . . . . . . . . . . . . . . .
46
5.8
The Schematron Schema
. . . . . . . . . . . . . . . . . . . . . . . . .
46
5.8.1
Exclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
5.8.2
Required Ancestors
. . . . . . . . . . . . . . . . . . . . . . . .
47
5.8.3
Referential Integrity
. . . . . . . . . . . . . . . . . . . . . . . .
47
5.9
The Non-Schema-Based Checkers
. . . . . . . . . . . . . . . . . . . .
49
5.9.1
Table Integrity Checker
. . . . . . . . . . . . . . . . . . . . . .
49
5.9.2
Checking the Text Content of Specific Elements
. . . . . . . .
52
5.9.3
Checking for Significant Inline Content
. . . . . . . . . . . .
53
5.9.4
Unicode Normalization Checking
. . . . . . . . . . . . . . . .
53
5.10
Character Model Checking
. . . . . . . . . . . . . . . . . . . . . . . .
55
6
Shortcomings
57
6.1
Non-Ideal Error Messages
. . . . . . . . . . . . . . . . . . . . . . . .
57
6.1.1
Bimorphic Content Models
. . . . . . . . . . . . . . . . . . . .
57
6.1.2
Lack of Datatype Diagnostics
. . . . . . . . . . . . . . . . . .
58
6.1.3
Erroneous Source Is Not Shown
. . . . . . . . . . . . . . . . .
58
CONTENTS
vii
6.2
Poor Localizability
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
6.3
Opportunities for Optimization
. . . . . . . . . . . . . . . . . . . . .
59
6.3.1
RELAX NG
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
6.3.2
Schematron
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
7
Applicability in Other Contexts
63
7.1
Auto-completion
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
7.2
Content Management Systems
. . . . . . . . . . . . . . . . . . . . . .
63
8
Future Work
65
8.1
Open Up
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
8.2
The HTML5 Parsing Algorithm
. . . . . . . . . . . . . . . . . . . . .
65
8.3
Tracking the Specification
. . . . . . . . . . . . . . . . . . . . . . . .
66
8.4
RELAX NG Message Improvements
. . . . . . . . . . . . . . . . . .
66
8.5
Completion of the Datatype Library
. . . . . . . . . . . . . . . . . .
67
8.6
More Non-Schema-Based Checkers
. . . . . . . . . . . . . . . . . . .
67
8.7
Assistance for Checking Human-Checkable Requirements
. . . . .
68
8.8
Web Service
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
68
8.9
Embedded MathML and SVG
. . . . . . . . . . . . . . . . . . . . . .
69
8.10
Showing the Erroneous Source Markup
. . . . . . . . . . . . . . . .
69
9
Conclusions
71
9.1
Correct Expectations
. . . . . . . . . . . . . . . . . . . . . . . . . . .
71
9.2
Incorrect Expectations about RELAX NG
. . . . . . . . . . . . . . .
71
9.3
Unexpected Discoveries about Schematron
. . . . . . . . . . . . . .
72
9.4
Overall Assessment
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
72
References
73
Appendix: Table Integrity Checker
93
viii
AN HTML5 CONFORMANCE CHECKER
Glossary
API
31
Application Programming Interface
, a convention for accessing the
functionality provided by a program module.
ASCC
23
Academia Sinica Computing Centre
, the birthplace of Schematron.
ASCII
53
American Standard Code for Information Interchange
, a 7-bit character
encoding.
Character
7
An atomic semantic component of text.
Character encoding
7
A way to encode a sequence of code points as a sequence of bytes.
Character set
7
A collection of characters with a code point assigned to each character
(strictly a
coded
character set).
Code point
21
An integer that identifies a character.
Code unit
54
A cluster of bits treated as a unit in a character encoding.
Conformance checking
2
The process of checking whether a document satisfies the conformance
criteria given in the specification for the language the document is expressed
in.
Conforming
2
Satisfying the conformance criteria given in the applicable specification.
CSS
9
Cascading Style Sheets
, the style sheet language of the Web.
DDML
21
Document Definition Markup Language
, a schema language for XML that
was abandoned in favor of the
XSD
.
Document tree
33
The tree structure embodied in the syntax of an
HTML
or
XML
document
which can be parsed from the syntax and represented as a concrete data
structure.
ix
DOM
16
Document Object Model
, an
API
and the associated data model for
representing
HTML
and
XML
document trees.
DSD
20
Document Structure Description
, a schema language for XML.
DSDL
21
Document Schema Definition Language
, a family of
ISO
standards that
define schema languages for XML (including RELAX NG and
ISO
Schematron).
DTD
17
Document Type Definition
, the built-in schema language of
SGML
and
XML
.
GML
6
Generalized Markup Language
,
IBM
’s predecessor to
SGML
.
HTML
5
HyperText Markup Language
, the language most Web pages are expressed
in.
HTML5
14
HyperText Markup Language 5
, a new version of
HTML
drafted by the
WHATWG
.
HTTP
59
Hypertext Transfer Protocol
, the primary transfer protocol of the Web.
IANA
45
Internet Assigned Numbers Authority
, an organization that maintains
registries of Internet media types, language tags,
IRI
schemes, etc.
IBM
6
International Business Machines
, a corporation.
IE
14
Internet Explorer
, Microsoft’s Web browser.
IETF
6
Internet Engineering Task Force
, a stardardization organization for Internet
technologies.
IIIR
6
Integration of Internet Information Resources
, an
IETF
working group that
published an early draft specification for
HTML
.
Infoset
18
The information structure embodied in an
XML
document, a formalization
of the document tree.
IRI
44
Internationalized Resource Identifier
, a network resource addressing scheme
that allows Unicode characters (as opposed to mere
ASCII
).
ISO
10
International Organization for Standardization
, a standardization
organization.
JSP
28
JavaServer Pages
, a way of mixing markup literals and Java code.
x
AN HTML5 CONFORMANCE CHECKER
Markup language
1
A computer language that contains machine-readable annotations (markup)
to human-readable text.
MSV
28
Sun Multi-Schema Validator
, a validation engine for RELAX NG and some
other schema languages for XML.
OASIS
21
Organization for the Advancement of Structured Information Standards
, an
XML
-oriented standardization organization.
PDA
12
Personal Digital Assistant
, a product class of portable computing devices.
PSVI
20
Post-Schema Validation Infoset
, the infoset of an
XML
document augmented
with datatype information as the result of
XSD
validation.
RELAX
21
Regular Language description for XML
, a schema language for
XML
which
together with
TREX
was used as the basis of RELAX NG.
RELAX NG
21
A grammar-based schema language for
XML
.
REST
64
Representational State Transfer
, an architechtural style for distributed
systems embodied in the design of
HTTP
.
RFC
6
Request for Comments
, a numbered memorandum published by the
IETF
.
SAX
31
Simple
API
for
XML
, a de facto standard parse event-based
API
through
which an
XML
parser reports the infoset of the parsed
XML
document to an
application as callbacks.
Schema
17
A formal definition (expressed in a schema language) that partitions the set
of all possible
XML
documents into two disjoint sets so that each document
is in either set: documents that are valid according to the schema and
documents that are not valid according to the schema.
Schema language
17
A computer language for expressing a schema.
Schematron
23
An assertion-based schema language for
XML
.
SGML
5
Standard Generalized Markup Language
, a syntax framework for defining
markup languages.
SVG
13
Scalable Vector Graphics
, an
XML
language for representing
two-dimensional vector graphics.
GLOSSARY
xi
TREX
21
Tree Regular Expressions for XML
, a schema language for
XML
which
together with
RELAX
was used as the basis of RELAX NG.
Unicode
53
The universal coded character set.
UTF-8
53
Unicode Transformation Format with 8-bit code units
, the preferred
character encoding for information interchange.
UTF-16
54
Unicode Transformation Format with 16-bit code units
, the character
encoding used for strings in Java.
UTF-32
54
Unicode Transformation Format with 32-bit code units
, a character encoding
where a single Unicode code point is always encoded as a single code unit.
W3C
8
World Wide Web Consortium
, an industry consortium that publishes
specifications for World Wide Web technologies.
Valid
17
Satisfying a schema.
Validation
17
The process of checking if a document is valid according to a schema.
WDG
26
Web Design Group
, a group that offers materials and tools (a validator, in
particular) related to
HTML
.
WHATWG
14
Web Hypertext Application Technology Working Group
, a collaborative
group of browser vendors, Web developers and other interested parties that
works on the next generation of
HTML
.
WXS
19
W3C
XML
Schema
, a schema language for
XML
.
XHTML
10
Extensible HyperText Markup Language
,
HTML
reformulated as an
XML
language.
(X)HTML
12
(Extensible) HyperText Markup Language
, a catch-all term for both
HTML
and
XHTML
.
XHTML5
14
XML
serialization of HyperText Markup Language 5
, the
XML
language
defined in parallel with
HTML5
.
(X)HTML5
14
(
XML
serialization of) HyperText Markup Language 5
, a catch-all term for
both
HTML5
and
XHTML5
XML
10
Extensible Markup Language
, a syntax framework for defining markup
languages, a stand-alone simplification of
SGML
.
xii
AN HTML5 CONFORMANCE CHECKER
XSD
19
XML
Schema Definition
, the filename extension for and, consequently, the
common way to refer to
W3C
XML
Schema.
XSLT
28
Extensible Stylesheet Language Transformations
, a programming language
designed for transforming
XML
documents into different
XML
documents.
GLOSSARY
xiii
Chapter 1
Introduction
The
Web
Hypertext
Application
Technology
Working
Group
(
WHATWG
)
is
devel-
oping
HTML5
and
its
parallel
XML
version,
XHTML5
,
as
successors
for
HTML
4.01
and
XHTML
1.0.
HTML5
and
XHTML5
are
defined
by
the
combination
of
WHATWG
’s
Web
Applications
1.0
[WebApps]
and
Web
Forms
2.0
[WebForms2]
spe-
cifications.
To
be
successful,
a
new
markup
language
not
only
needs
support
from
browsers,
it
also
needs
tools
that
support
authoring.
Authoring-side
tools
include
editors,
content
management
systems
and
quality
assurance
tools
for
checking
the
correctness of markup. This thesis focuses on the last.
1.1
Motivation
Web
authors
tend
to
make
mistakes
when
writing
HTML
.
The
vast
majority
of
HTML
documents
on
the
Web
are
syntactically
incorrect.
A
test
of
the
HTML5
pars-
ing
algorithm
on
several
billion
documents
spidered
by
Google
indicated
that
93%
of
documents
had
errors
on
the
lowest
levels
of
the
syntax
[Several]
.
(Documents
in
the
remaining
7%
may
well
have
higher-level
errors
that
are
not
found
by
the
pars-
ing algorithm and would require a full conformance checker to find.)
Even
though
most
Web
content
is
broken
without
hope
of
repair
and
browsers
will
do
something
with
any
input
purporting
to
be
HTML
,
it
is
still
useful
to
provide
a
quality
assurance
tool
for
authors.
Even
if
browsers
adopt
the
well-defined
error-
recovering
processing
models
of
HTML5
,
authors
generally
do
not
make
errors
on
purpose
in
order
to
elicit
particular
error
recovery
response.
Silent
recovery
from
inadvertent
mistakes

even
if
deterministic
and
well
defined

may
still
confuse
an
author
who
did
not
mean
to
invoke
error
recovery.
The
issue
becomes
more
appar-
ent
when
an
author
uses
a
style
sheet
or
a
script
that
assumes
the
document
to
be
correct.
Therefore,
it
is
worthwhile
to
provide
a
conformance
checker
that
helps
au-
thors find their mistakes.
1
1.2
Objectives
The
functional
objective
of
the
project
described
in
this
thesis
was
developing
a
par-
tial
(X)HTML5
conformance
checker
that
is
comprehensive
enough
to
demonstrate
that
it
can
be
taken
to
completion
once
(X)HTML5
itself
has
stabilized.
The
research
goals
were
1)
finding
out
if
a
hybrid
implementation
based
both
on
schemata
and
on
custom
code
developed
in
a
general-purpose
programming
language
is
feasible
and
2)
finding
out
if
an
XML
toolchain
can
be
successfully
applied
to
checking
the
non-
XML
serialization
of
HTML5
.
Only
markup
checking
without
executing
scripts
is considered due to the halting problem
[Computable]
.
1.3
Methods
For
HTML
4.01
and
XHTML
1.0,
validators
based
on
Document
Type
Definitions
(
DTD
s),
the
built-in
schema
language
of
SGML
and
XML
,
have
traditionally
been
used
as
the
quality
assurance
tools
for
checking
correctness
even
though
they
do
not
check
for
all
machine-checkable
conformance
requirements.
For
(X)HTML5
,
a
conformance
checker
is
expected
to
take
the
role
that
DTD
-based
validators
have
had
with
earlier
(X)HTML
.
Conformance
checking
goes
beyond
the
capabilities
of
DTD
s.
The
WHATWG
does
not
prescribe
an
implementation
strategy
for
conformance
checkers
and
does
not
endorse
schema
languages.
Not
only
are
schema
languages
unendorsed
but
also
they
are
seen
as
being
clearly
inadequate.
Therefore,
a
non-
schema-based
implementation
strategy
is
implied.
Yet,
as
an
initial
impression,
abandoning
schemata
altogether
just
because
they
cannot
be
used
for
checking
every
machine-checkable
constraint
seems
overly
drastic.
Hence,
I
chose
a
hybrid
approach
that
uses
schemata
for
what
they
are
good
for
and
uses
a
non-schema-
based implementation strategy for what schemata are not good for.
I
chose
RELAX
NG
as
the
primary
schema
language,
and
Schematron
as
a
sup-
porting
schema
language.
Using
RELAX
NG
for
document-oriented
schemata
(as
opposed
to
databinding-oriented
schemata)
had
gained
acceptance
as
the
best
prac-
tice
among
users
of
XML
schema
languages.
Schematron
had
gained
popularity
as
a
language
for
refining
RELAX
NG
schemata.
Elika
Etemad
had
already
started
a
project
for
developing
a
RELAX
NG
schema
for
HTML5
[HTML5RNG]
.
Moreover,
I
had
already
developed
a
service
that
allows
Web
users
to
validate
XML
docu-
ments
against
arbitrary
RELAX
NG
and
Schematron
1.5
schemata
[Validat-
orAbout]
.
I
had
developed
the
service
in
the
Java
programming
language
due
to
the
excellent
availability
of
XML
tools
for
Java.
I
chose
Etemad’s
schema
project
and
the
service
I
had
already
developed
as
starting
points
for
this
thesis
project.
Since
I
had
written
my
pre-existing
software
in
Java,
it
followed
that
I
would
also
write
the
new non-schema code in Java.
The
parsed
syntax
tree
for
HTML5
and
the
parsed
syntax
tree
for
XML
are
very
similar.
Since
reusable
tools
exist
for
XML
,
I
decided
to
use
XML
tools
and
to
map
HTML5
documents
to
equivalent
XHTML5
representations
in
the
parser.
To
this
2
AN HTML5 CONFORMANCE CHECKER
end,
I
wrote
an
HTML
parser
(page
32)
that
acts
as
if
it
were
an
XML
parser
parsing
XHTML
.
1.4
Availability of the Software
The
generic
validation
service
that
I
used
as
the
basis
of
the
conformance
checker
is
usable online at
http://hsivonen.iki.fi/validator/
.
The
work
I
did
in
order
to
add
HTML5
support
to
the
generic
validation
service
included:

an
HTML
parser
(page 32)

significant work on
a RELAX NG schema for
(X)HTML5
(page 38)

a Schematron schema complementing the RELAX NG schema
(page 46)

a RELAX NG datatype library for
HTML5
datatypes
(page 43)

non-schema-based checkers for requirements that schemata cannot express
(page 49)
The
software
I
developed
is
Free
Software
/
Open
Source.
The
source
code
may
be
obtained by following links from
http://hsivonen.iki.fi/validator-about/
.
The
product
of
this
thesis
project
is
usable
online
at
http://hsivonen.iki.fi/
validator/html5/
.
1.5
Organization of this Thesis
This
thesis
has
two
thematic
parts.
The
first
part
(the
next
three
chapters)
reviews
the
context
of
this
work.
HTML5
is
placed
in
historical
context,
schema
languages
for
XML
are
reviewed
and
prior
work
on
online
markup
checking
services
is
re-
viewed.
The
second
part
(the
last
five
chapters)
focuses
on
the
software
implemen-
ted
in
this
project.
The
implementation
of
the
software,
its
shortcomings,
and
its
ap-
plicability
to
other
contexts
are
discussed.
Finally,
the
need
for
future
work
is
re-
viewed and the conclusions given.
CHAPTER 1. INTRODUCTION
3
Chapter 2
History of
HTML
Leading to
HTML5
This
chapter
reviews
the
history
of
Hypertext
Markup
Language
(
HTML
)
leading
to
HTML5
.
HTML
is,
in
principle,
a
semantic
markup
language.
That
is,
it
encodes,
for
ex-
ample,
that
a
particular
piece
of
text
is
a
heading
as
opposed
to
encoding
the
exact
presentation.
HTML
has
never
been
only
about
presentation
and
has
never
been
only
about
encoding
the
profound
semantics
of
text.
The
positioning
of
HTML
somewhere
in
between
these
extremes
has
shifted
in
both
directions
with
different
versions.
Since
one
of
the
major
changes
in
HTML5
is
the
way
the
specification
deals
with
parsing
and
the
stance
the
specification
takes
with
respect
to
Standard
Generalized
Markup
Language
(
SGML
[ISO8879]
),
each
version
of
HTML
prior
to
HTML5
is
summarized
in
terms
of
the
key
features
introduced
and
in
terms
of
the
stated
rela-
tionship
to
SGML
or
XML
(Extensible
Markup
Language
[XML]
).
SGML
is
a
syntax
framework
for
defining
markup
languages.
XML
is
a
simplification
of
SGML
.
SGML
and
XML
define the parsing layer of markup language processing.
2.1
Early
HTML
In
this
review,
HTML
versions
prior
to
HTML
4
are
considered
“early”,
as
they
are
no longer in active use when new documents are created.
2.1.1
Initial
HTML
at CERN
Tim
Berners-Lee
invented
the
Web
in
1989.
He
released
the
first
version
of
his
browser
in
1990
[Raggett]
.
The
system
used
HTML
,
but
the
language
was
not
form-
ally
specified
at
first.
Tim
Berners-Lee
designed
HTML
using
ideas
from
SGML
[Weaving]
.
However,
HTML
was
not
layered
on
top
of
the
SGML
standard
but
rather used a similar syntax without being a true application of
SGML
.
The
element
names
available
in
HTML
were
largely
taken
from
SGMLguid,
an
application
of
SGML
used
at
CERN.
SGMLguid,
in
turn,
was
similar
to
Waterloo
5
SCRIPT
GML
[WaterlooGML]
,
a
GML
language
specified
at
University
of
Water-
loo.
(
GML
[Generalized]
was
IBM
’s
predecessor
to
SGML
.)
[EarlyHistory]
There
are
also
similarities
with
the
language
given
in
the
tutorial
of
the
SGML
standard
[ISO8879]
.
2.1.2
The
IIIR
Draft
Tim
Berners-Lee
and
Dan
Connolly
wrote
an
Internet
Draft
specification
for
HTML
as
part
of
the
activity
of
the
Integration
of
Internet
Information
Resources
(
IIIR
)
working
group
of
the
Internet
Engineering
Task
Force
(
IETF
).
The
Internet
Draft
was published in June 1993.
[IIIR-HTML]
The
draft
said
that
HTML
was
defined
in
terms
of
SGML
.
However,
the
specific-
ation
did
not
specify
an
HTML
document
as
a
conforming
SGML
document
entity
but
instead
said
how
to
construct
an
SGML
document
from
an
HTML
file
[IIIR-
HTML]
.
The
draft
also
suggested
that
an
HTML
parser
would
not
need
to
be
a
full
SGML
parser
but
a
parser
that
only
deals
with
the
document
instance
after
the
DTD
[IIIR-HTML]
.
W.
Eliot
Kimber,
an
SGML
expert,
challenged
the
purity
of
the
draf-
ted
HTML
approach
in
terms
of
SGML
[ToBeDeleted]
.
The
stated
approach
was
changed
in
later
specifications
to
make
an
HTML
file
directly
an
SGML
document.
However,
browsers
continued
to
use
special-purpose
parsers
(as
opposed
to
SGML
parsers)
as
before.
The
mailing
list
discussions
about
the
relationship
of
HTML
to
SGML
are summarized in
[Cascading]
.
The
IIIR
draft
already
included
the
IMG
element
for
images.
The
P
element
was
defined
as
an
empty
element
that
indicates
paragraph
breaks.
As
an
interesting
de-
tail,
the
XMP
,
LISTING
and
PLAINTEXT
elements
for
including
verbatim
text
in
HTML
were
considered
obsolete
as
early
as
the
IIIR
draft
(although
they
would
still
show
up
in
the
HTML5
parsing
algorithm
over
a
decade
later
[WebApps]
).
[IIIR-HTML]
The draft expired and did not reach the
RFC
status.
2.1.3
HTML
+
Dave
Raggett,
one
of
the
participants
of
the
www-talk
for
discussing
Web
matters,
visited
Tim
Berners-Lee
at
CERN
to
discuss
further
development
face
to
face.
Based
on
the
discussion,
Raggett
drafted
a
new
version
of
HTML
called
HTML
+.
[Raggett]
The
draft
specification
for
HTML
+,
published
in
late
1993,
specifically
stated
that
HTML
+
was
“based
on
the
Standard
Generalized
Markup
Language”.
It
also
had
a
Document
Type
Definition
(
DTD
),
a
formal
grammar
expressed
using
the
built-in
schema
language
(page
17)
of
SGML
.
In
theory,
having
a
DTD
enabled
the
use
of
SGML
parsers.
However,
the
draft
implied
that
there
would
be

HTML
+
parsers”
which
would
be
different
from
“other
SGML
parsers”.
HTML
+
explicitly
excluded
SGML
minimization
features.
It
used
the
P
element
as
a
paragraph
6
AN HTML5 CONFORMANCE CHECKER
container
but
said
that
authors
may
think
of
the
P
tag
as
a
paragraph
separator
.
[HTMLplus]
HTML
+
had
a
number
of
elements
that
never
entered
into
actual
usage,
such
as
BYLINE
,
ONLINE
,
PRINTED
,
and
ABSTRACT
.
As
a
curious
detail,
HTML
+
included
an
element
called
IMAGE
,
which
used
the
element
content
as
the
alternative
text

a
feature
that
would
still
be
discussed
over
a
decade
later.
HTML
+
attempted
to
ad-
dress
the
issue
of
mathematical
formulae,
but
the
coverage
of
types
of
formulae
was not particularly comprehensive.
[HTMLplus]
HTML
+
defined
markup
for
tables.
The
table
markup
is
roughly
what
was
later
adopted
in
HTML
4.
HTML
+
also
defined
markup
for
forms
similar
to
what
was
actually
adopted
in
browsers.
However,
the
field
types
also
included
types
that
were not adopted, such as
URL
,
DATE
and
SCRIBBLE
(for drawing).
[HTMLplus]
Mainstream
browsers
never
adopted
HTML
+.
However,
at
the
first
World
Wide
Web
conference
it
was
agreed
that
the
ideas
from
HTML
+
should
be
carried
for-
ward.
[Raggett]
2.1.4
HTML
2.0
Dan
Connolly
had
advocated
a
cross-browser
HTML
specification
at
the
first
World
Wide
Web
conference
in
early
1994.
Subsequently,
the
IETF
formed
a
working
group
to
specify
HTML
.
The
working
group

with
Connolly
in
the
lead

defined
HTML
2.0
based
on
the
then-current
practice.
The
HTML
2.0
draft
was
published
in
July 1994.
[Raggett]
The
HTML
2.0
specification
reached
the
RFC
status
in
November
1995.
The
spe-
cification
stated
that

HTML
is
an
application
of
ISO
Standard
8879:1986
Informa-
tion
Processing
Text
and
Office
Systems;
Standard
Generalized
Markup
Language
(
SGML
).”
[RFC1866]
However,
it
was
too
late
to
make
browsers
use
SGML
parsers.
Instead,
browsers
continued
to
use
special-purpose
HTML
parsers
without
stand-
ardized
error
recovery
behavior.
HTML
2.0
included
a
DTD
,
but
the
DTD
was
of
no
interest to browsers.
Unlike
the
elements
proposed
in
HTML
+,
the
elements
of
HTML
2.0
were
(and
still
are)
actually
supported
by
browsers.
HTML
2.0
included
forms
but
did
not
in-
clude
tables,
which
had
been
proposed
in
HTML
+.
Regardless,
Netscape
imple-
mented tables in its browser in the
HTML
2.0 era and made tables popular.
HTML
2.0
established
that
the
document
character
set
of
HTML
is
ISO
10646
re-
gardless
of
the
character
encoding
used
to
transfer
the
document.
(The
character
al-
locations
in
ISO
10646
track
the
allocations
of
Unicode
[ISO10646]
[Unicode]
.)
However,
the
internationalization
of
HTML
2.0
was
not
fully
addressed
in
the
HTML
2.0
specification
itself,
and
a
standards
track
RFC
that
extended
HTML
2.0
to
address internationalization issues was published in late 1997
[RFC2070]
.
CHAPTER 2. HISTORY OF HTML LEADING TO HTML5
7
2.1.5
HTML
3.0
To
keep
the
Web
unified
amidst
product
development
by
various
competing
vendors,
an
industry
consortium
called
The
World
Wide
Web
Consortium
(
W3C
)
was founded in 1994 to develop specifications for the Web.
[Weaving]
Dave
Raggett

this
time
representing
the
W3C

edited
a
specification
called
HTML
3.0,
which
carried
forward
the
ideas
of
HTML
+
[Raggett]
.
To
support
the
use
of
style
sheets,
HTML
3.0
introduced
the
STYLE
element
and
the
CLASS
attrib-
ute,
which
lived
on
in
HTML
4
[Raggett]
.
An
HTML
3.0
draft
was
published
through the
IETF
as an Internet Draft
[HTML30]
.
Meanwhile,
Netscape
extended
HTML
on
its
own.
In
particular,
its
extensions
included presentational features instead of adopting style sheets.
HTML
3.0
did
not
match
what
was
being
implemented
in
browsers.
The
draft
specification
was
abandoned
and
never
reached
the
RFC
status
[Raggett]
.
Even
though
HTML
3.0
as
a
whole
was
dropped,
a
specification
for
HTML
tables
(as
an
extension to
HTML
2.0) was published as an experimental
RFC
.
[RFC1942]
2.1.6
HTML
3.2
In
November
1995,
representatives
of
browser
vendors
and
the
W3C
formed
an
HTML
working
group
at
the
W3C
.
The
following
month,
the
IETF
HTML
working
group was disbanded.
[Raggett]
In
January
1997,
the
W3C
published
the
specification
for
HTML
3.2
as
a
Recom-
mendation.
Unlike
HTML
3.0,
HTML
3.2
documented
actual
practice
that
had
grown
as
extensions
to
HTML
2.0.
The
specification
itself
stated:

HTML
3.2
aims
to
capture
recommended
practice
as
of
early
’96
and
as
such
to
be
used
as
a
replace-
ment for
HTML
2.0 (
RFC
1866).”
[HTML32]
HTML
3.2
continued
to
say
that
HTML
was
an
application
of
SGML
:

HTML
3.2
is
an
SGML
application
conforming
to
International
Standard
ISO
8879

Standard
Generalized
Markup
Language.
As
an
SGML
application,
the
syntax
of
conforming
HTML
3.2
documents
is
defined
by
the
combination
of
the
SGML
declaration
and
the
document
type
definition
(
DTD
).”
[HTML32]
However,
even
the
specification
itself
admitted
that
SGML
-compliance
of
user
agents
was
not
part
of
the
actual
practice
as
of
early
’96
by
noting:
“Note
that
some
user
agents
require
attribute
minimisation
for
the
following
attributes:
COMPACT
,
ISMAP
,
CHECKED
,
NOWRAP
,
NOSHADE
and
NOHREF
.
These
user
agents
don’t
accept
syntax
such
as
COMPACT=COMPACT
or
ISMAP=ISMAP
although
this
is
legitimate
according
to
the
HTML
3.2
DTD
.”
[HTML32]
In
documenting
the
actual
practice,
HTML
3.2
included
presentational
features,
such as the
FONT
element, that would later be deprecated.
8
AN HTML5 CONFORMANCE CHECKER
2.2
Contemporary
HTML
The
versions
of
HTML
discussed
above
are
of
historical
interest
and
are
not
in
act-
ive use for creating new documents. The versions in current use start with
HTML
4.
2.2.1
HTML
4
HTML
4.0
was
published
as
a
W3C
Recommendation
in
December
1997
[HTML40]
.
The
specification
formalized
existing
features
that
had
been
introduced
by
browser
vendors
but
also
introduced
new
features
of
its
own.
HTML
4.0
was
revised
without
incrementing
the
version
number,
and
the
revision
was
published
in
April
1998
[HTML40rev]
.
Another
revision
called
HTML
4.01
became
a
W3C
Recom-
mendation in December 1999
[HTML401]
.
Again,
the
specification
said,

HTML
4
is
an
SGML
application
conforming
to
International
Standard
ISO
8879

Standard
Generalized
Markup
Language.”
[HTML401]
Yet,
the
specification
acknowledged
the
reality
that
user
agents
in
gen-
eral
are
not
conforming
SGML
systems:

SGML
systems
conforming
to
[ISO8879]
are
expected
to
recognize
a
number
of
features
that
aren’t
widely
supported
by
HTML
user
agents.
We
recommend
that
authors
avoid
using
all
of
these
features.”
[HTML401]
HTML
4
deprecated
presentational
features
such
as
the
FONT
element
that
had
made
its
way
to
a
W3C
Recommendation
less
than
a
year
before
the
first
version
of
HTML
4.
In
principle,
HTML
4
tried
to
backpedal
on
the
point
of
presentational
features
to
where
HTML
2.0
had
been

with
the
intent
of
leaving
presentation
to
style
sheets
such
as
Cascading
Style
Sheets
(
CSS
)
[Cascading]
,
which
had
been
pub-
lished as a
W3C
Recommendation
[CSS1]
the year before.
HTML
4
without
the
deprecated
features
was
termed
“Strict”
and
HTML
4
with
the
deprecated
presentational
features
was
termed
“Transitional”.
In
practice,
the
deprecated
features
continue
to
be
used
nine
years
later
even
though
CSS
has
been
very
successful
both
in
terms
of
acceptance
by
Web
authors
and
in
terms
of
implementations.
HTML
4
included
features
for
adding
more
structure
to
tables,
for
adding
more
structure
to
forms,
and
for
marking
up
insertions
and
deletions.
HTML
4
adopted
the
model
proposed
in
the
experimental
RFC
on
HTML
tables
[RFC1942]
dropping
a
few
presentational
attributes.
Internationalization
features,
including
support
for
bidirectional
text
(e.g.
for
Hebrew
and
Arabic),
were
adopted
from
the
standards
track
RFC
on the internationalization of
HTML
[RFC2070]
.
HTML
4
introduced
the
OBJECT
element,
which
was
supposed
to
eventually
re-
place
IMG
,
APPLET
and
the
Netscape
EMBED
elements.
EMBED
did
not
fit
together
with
an
SGML
DTD
,
because
it
could
take
arbitrary
attributes.
However,
in
prac-
tice,
browsers
continued
to
support
EMBED
,
and
even
today
browsers
do
not
fully
support
OBJECT
as designed.
CHAPTER 2. HISTORY OF HTML LEADING TO HTML5
9
HTML
4
formalized
frames,
which
had
been
introduced
by
Netscape
and
were
discredited
[Frames]
even
before
their
inclusion
in
HTML
4.
Additionally,
HTML
4
included
IFRAME
from Microsoft.
2.2.2
ISO
HTML
In
2000,
ISO
standardized
its
own
version
of
HTML
by
referencing
a
subset
of
HTML
4.0
as
defined
by
the
W3C
but
also
making
changes
other
than
merely
sub-
setting
in
the
DTD
[ISO15445]
.
A
technical
corrigendum
changed
the
references
to
HTML
4.01
[ISO15445TC1]
.
In
practice,
ISO
HTML
is
only
of
curiosity
value,
since
Web
authors
have
largely
ignored it.
2.2.3
XHTML
1.0
Extensible
Markup
Language
(
XML
)
1.0
[XML]
was
published
as
a
W3C
Recommend-
ation
in
February
1998
[AXML]
.
XML
is
a
simplification
of
SGML
that
stands
alone
without
making
a
normative
reference
to
SGML
.
Since
HTML
was
defined
as
an
application
of
SGML
and
the
W3C
now
had
its
own
replacement
for
SGML
,
the
W3C
decided
to
swap
the
markup
language
framework
from
underneath
HTML
.
The
result
was
XHTML
1.0

a
reformulation
of
HTML
4
in
XML
.
XHTML
1.0
be-
came a Recommendation in January 2000
[XHTML10]
.
XHTML
1.0
includes
the
features
that
were
deprecated
in
HTML
4.
That
is,
XHTML
1.0 has three versions just like
HTML
4: Strict, Transitional and Frameset.
Appendix
C
.
To
be
compatible
with
existing
HTML
user
agents,
the
XHTML
1.0
specification
included
compatibility
guidelines
commonly
known
as
“Appendix
C”.
Appendix
C
limits
the
syntactic
sugar
permitted
by
XML
1.0
so
that
an
XHTML
1.0
document
that
adheres
to
Appendix
C
could
be
processed
by
existing
HTML
user
agents
if
served
as
text/
html
[RFC2854]
media
type.
Appendix
C
relies
on
the fact that browsers do not actually process
text/html
as
SGML
.
Appendix
C
made
it
seem
that
XHTML
1.0
was
succeeding
by
being
adopted
immediately.
Obviously,
since
the
browsers
gained
no
new
capabilities,
using
XHTML
1.0
could
not
actually
deliver
any
true
benefits
over
HTML
4
in
user
agents
designed
for
HTML
.
No
XML
processor
was
involved
despite
the
XHTML
1.0
being
a
reformulation
in
XML
.
In
fact,
the
HTML
WG
of
the
W3C
gave
an
explicit
opinion
that
browsers
should
not
try
to
process
documents
served
as
text/
html
using
a
real
XML
processor
[Sniffing]
.
There
are
experts
close
to
the
development
of
browser
engines
who
have
dis-
credited
the
practice
of
serving
XHTML
as
text/
html
,
because
authors
are
not
ac-
tually
invoking
any
new
kind
of
processing
but
end
up
making
documents
that
rely
on
error
handling
and
would
not
work
with
the
new
kind
of
processing
(e.g.
[Harmful]
and
[Understanding]
).
10
AN HTML5 CONFORMANCE CHECKER
Processing
as
XML
.
Later
on,
Mozilla,
Opera
and
Apple
(three
of
the
top
four
browser
vendors
after
the
demise
of
Netscape)
took
the
XML
nature
of
XHTML
ser-
iously
and
implemented
support
for
XHTML
1.0
using
a
real
XML
processor.
A
real
XML
processor
is
used
when
the
document
is
served
using
the
application/
xhtml+xml
[RFC3236]
media type (instead of
text/html
).
Serving
pages
as
application/
xhtml+xml
has
not
become
popular
among
Web
authors
for
three
reasons.
First,
since
XHTML
1.0
is
a
reformulation
of
HTML
4
on
top
of
another
markup
language
framework,
it
(alone)
does
not
enable
new
in-
teresting
things
in
the
browser.
This
means
there
is
not
a
compelling
technical
ad-
vantage
to
be
gained
from
using
XHTML
1.0
served
as
application/
xhtml+xml
over
HTML
4.01
served
as
text/
html
.
Second,
the
browser
engine
with
the
largest
desktop
market
share
(Trident,
the
engine
of
Microsoft’s
Internet
Explorer)
still
does
not
support
application/
xhtml+xml
.
(Browser
market
share
is
difficult
to
define
and
measure,
but
the
global
usage
share
of
Internet
Explorer
is
estimated
to
be
above
80%
in
early
2007
[OneStat]
[TheCounter]
.)
Third,
when
a
real
XML
pro-
cessor
is
used,
an
error
is
reported
if
the
document
violates
the
well-formedness
constraints
of
XML
.
Often,
the
document
is
not
displayed
at
all
if
it
violates
these
syntactic
constraints.
This
means
that
a
small
authoring
error
breaks
the
document
completely.
In
contrast
when
content
is
served
as
text/
html
,
browsers
try
to
re-
cover from markup errors.
Moreover,
there
are
subtle
differences
in
the
ways
Cascading
Style
Sheets
[CSS2]
and
the
Document
Object
Model
[DOM2]
API
exposed
to
JavaScript
interact
with
text/
html
and
application/
xhtml+xml
.
Differences
involve
issues
such
as
case-sensitivity
and
whether
elements
are
in
a
namespace
[MozFAQ]
.
In
addi-
tion,
document.write()
,
which
allows
scripts
to
insert
data
into
the
character
stream
being
parsed,
does
not
work
in
XML
.
In
practice,
scripts
written
naïvely
for
XHTML
served
as
text/
html
do
not
work
when
the
document
is
served
as
application/xhtml+xml
.
2.2.4
Modularization
The
W3C
decided
to
abandon
the
development
of
the
old
non-
XML
HTML
and
to
only
develop
XHTML
.
After
the
reformulation
of
HTML
4
in
XML
,
which
became
XHTML
1.0,
the
W3C
HTML
working
group
proceeded
to
modularize
XHTML
.
Modularization
meant
dividing
XHTML
into
logical
parts
such
as
Hypertext
Mod-
ule
and
Image
Module
and
rewriting
the
previously
monolithic
DTD
as
multiple
files following the logical module partitioning.
The
rationale
for
the
modularization
was
based
on
a
view
that
one
size
of
XHTML
did
not
fit
all
client
platforms.
In
the
words
of
the
specification
itself:
“This
modularization
provides
a
means
for
subsetting
and
extending
XHTML
,
a
feature
needed
for
extending
XHTML
’s
reach
onto
emerging
platforms.”
[M12N]
The
fore-
most
“emerging
platforms”
were
mobile
phones,
which
were
thought
to
be
unable
to
host
a
browser
for
full
HTML
.
The
rationale
for
modularization
implicitly
as-
sumes
a
walled
garden-style
world
view
of
the
owners
of
mobile
phone
networks
CHAPTER 2. HISTORY OF HTML LEADING TO HTML5
11
where
a
client
platform
design
can
dictate
a
language
subset
used
on
the
network.
Such
a
view
assumes
a
separate
“Mobile
Web”,
because

quite
obviously

the
World Wide Web would still use full
HTML
or
XHTML
.
XHTML
Basic
.
XHTML
Basic
,
published
in
late
2000,
defines
a
baseline
for
XHTML
languages
built
on
top
of
the
Modularization.
XHTML
Basic
is
a
subset
of
XHTML
1.0.
The
specification
itself
lists
“mobile
phones,
PDA
s,
pagers,
and
settop
boxes”
as
target devices.
[XHTMLBasic]
XHTML
1.1
.
XHTML
1.1
[XHTML11]
,
published
in
2001,
was
the
first
(X)HTML
specification
since
HTML
4
that
introduced
a
new
feature.
XHTML
1.1
includes
the
XHTML
modules
that
correspond
to
XHTML
1.0
Strict.
Additionally,
XHTML
1.1
includes
the
XHTML
Ruby
Annotation
module
[Ruby]
for
expressing
a
type
of
text
annotations used in East Asia.
Microsoft’s
Internet
Explorer
for
Windows
5.0
(and
later)
supports
a
draft
ver-
sion
of
Ruby
markup
when
used
in
text/
html
documents
[RubyIE]
.
However,
the
most
notable
browsers
that
support
application/
xhtml+xml
do
not
support
Ruby. Therefore,
XHTML
1.1 has failed to make a significant practical impact.
XHTML
Mobile
Profile
.
In
2001,
WAP
Forum

a
consortium
of
mobile
phone
manufacturers

defined
a
superset
of
XHTML
Basic
called
XHTML
Mobile
Profile
[XHTML-MP]
.
The
profile
did
not
follow
the
prescribed
XHTML
module
boundaries.
The
specification
defined
application/
vnd.wap.xhtml+xml
as
the
media
type
for
XHTML
Mobile
Profile
documents
[XHTML-MP]
,
but
this
media
type
has
not
been
officially
registered.
The
profile
has
not
made
a
notable
impact
on
the
World Wide Web.
2.3
HTML5
The above review explains the context in which
HTML5
was born.
The
prior
versions
of
HTML
had
officially
been
applications
of
SGML
,
but
browsers
were
actually
using
special-purpose
HTML
parsers
rather
than
SGML
parsers.
The
SGML
basis
only
gave
guidance
on
what
document
tree
was
expected
when
a
document
was
conforming.
There
was
no
realistic
specification
for
parsing
HTML
when
the
input
was
erroneous
(which
it
most
often
is).
Browser
vendors
had
to
reverse
engineer
the
behavior
of
the
current
market
leader.
This
has
caused
interoperability problems.
Moreover,
significant
new
features
had
not
been
introduced
in
years
as
the
work
had
focused
on
reformulating
the
syntax
as
XML
.
Yet,
documents
purporting
to
use
the
reformulated
XML
syntax
were
still
served
as
text/
html
,
so
browsers
kept
using
the
same
special-purpose
parsers
as
before.
The
usage
of
application/xhtml+xml
had failed to take off.
12
AN HTML5 CONFORMANCE CHECKER
There
was
demand
for
new
features
for
HTML
and
demand
for
the
recognition
of
the
fact
that
text/
html
content
was
parsed
neither
as
SGML
nor
as
XML
but
had a syntax of its own.
2.3.1
The Mozilla/Opera Joint Position Paper
The
balance
of
power
in
the
W3C
had
shifted
from
traditional
desktop
browser
vendors
to
various
other
interest
groups
such
as
makers
of
software
for
mobile
walled
gardens
and
developers
of
“rich
client”
technologies
that
could
be
deployed
on
intranets
but
that
were
not
used
by
the
general
public
on
the
Web.
This
had
led
to
a
situation
where
the
focus
was
more
on
the
“Semantic
Web”,
“Web
Services”
and
“Mobile
Web”
than
on
what
is
usually
considered
“the
Web”.
As
a
result,
the
development of the Web itself had been neglected.
In
June
2004,
the
W3C
held
a
workshop
on
Web
Applications
and
Compound
Documents.
The
Mozilla
Foundation
and
Opera
Software

the
two
most
active
browser
vendors
in
the
W3C
at
the
time

submitted
a
joint
position
paper
noting
the
“rising
threat
of
single-vendor
solutions”
and
calling
for
seven
principles
to
be
followed
in
the
design
of
Web
Applications
Technologies
[JointPosition]
.
(At
the
time
Microsoft

a
notable
browser
vendor
itself

was
pushing
a
single-vendor
solution
code
named
Avalon
[MS-WebApps]
and
Apple
was
catching
up
having
entered the market only recently.)
The
first
one
of
the
seven
principles
in
the
Mozilla/
Opera
position
paper
was
“Backwards
compatibility,
clear
migration
path”
[JointPosition]
.
The
transition
from
HTML
4
to
XHTML
1.0
had
not
worked
out
smoothly
as
discussed
earlier
(page
10)
.
In
addition,
XForms
[XForms]
,
the
W3C
’s
successor
for
HTML
forms,
did
not
provide
backwards
compatibility
or
a
clear
migration
path.
Moreover,
the
HTML
working
group
was
working
on
XHTML
2.0
[XHTML20]
,
which
was
de-
signed
to
be
incompatible
with
XHTML
1.x,
even
though
the
transition
to
XHTML
1.x served as
application/xhtml+xml
was not complete.
The
position
paper
called
for
well-defined
error
handling

something
that
had
never
been
addressed
for
HTML
.
The
paper
took
a
position
in
favor
of
graceful
re-
covery (as in
CSS
[CSS2]
) and against the Draconian error policy of
XML
.
The
paper
called
for
every
feature
to
be
backed
by
a
practical
use
case
and
for
the
specification
process
to
be
open.
This
was
in
contrast
with
including
features
that
are
“nice
to
have”
and
making
decisions
on
the
W3C
’s
member-only
mailing
lists.
The
paper
took
a
position
against
device-specific
profiles.
This
was
in
direct
contrast
with
the
Modularization
of
XHTML
(page
11)
[M12N]
as
well
as
mobile
pro-
files
of
other
W3C
deliverables
such
as
Scalable
Vector
Graphics
(
SVG
[SVG]
).
The
paper
also
took
a
position
more
favorable
to
scripting
(JavaScript
[JavaScript]
in
practice) than what has been the general line in the
W3C
.
The
paper
stated
two
design
principles
for
compound
documents
(documents
that
mix
different
XML
vocabularies):
“Don’t
overuse
namespaces”
and
“Migration
path”.
The
latter
was
related
to
the
problems
with
the
HTML
to
XHTML
migration
discussed above. The position paper was dismissive of schema languages.
CHAPTER 2. HISTORY OF HTML LEADING TO HTML5
13
The
paper
went
on
to
list
specific
features
that
a
Web
application
host
environ-
ment
should
provide.
It
made
several
references
to
XBL,
which
has
been
a
very
politicized
language
(but
is
now
on
track
to
become
a
W3C
Recommendation
[XBL2]
).
2.3.2
The
WHATWG
is Formed
The
proposal
presented
by
Opera
Software
and
the
Mozilla
Foundation
was
not
well
received
at
the
W3C
.
At
the
end
of
the
second
day
of
the
workshop,
a
poll
was
held
on
the
topic
of
the
joint
position
paper:
whether
the
W3C
should
develop
ex-
tensions
to
HTML
,
CSS
and
the
DOM
as
proposed.
Of
the
51
attendees
of
the
work-
shop,
8
voted
in
favor
of
the
motion
and
11
voted
against.
When
the
motion
was
slightly reformulated, 14 voted against.
[cdf-ws-minutes2]
Two
days
after
the
vote
at
the
workshop,
The
Web
Hypertext
Applications
Technology
Working
Group
(
WHATWG
)
and
its
public
mailing
list
were
publicly
announced.
The
group
was
described
as
“a
loose,
unofficial,
and
open
collaboration
of
Web
browser
manufacturers
and
interested
parties”.
The
stated
intent
was
creat-
ing
specifications
for
implementation
in
“mass-market
Web
browsers,
in
particular
Safari, Mozilla, and Opera”.
[WHAT-Ann]
The
initial
(invite-only)
membership
of
the
WHATWG
consisted
of
individuals
affiliated
with
Apple,
Mozilla
and
Opera
Software.
(Ian
Hickson,
the
editor
of
the
WHATWG
specifications,
later
moved
to
Google.)
However,
in
the
view
of
the
Web
held
by
the
WHATWG
,
there
is
also
a
fourth
mass-market
browser:
Microsoft’s
In-
ternet
Explorer

the
leader
in
market
share.
Microsoft
has
not
been
participating
in
the
WHATWG
despite
having
been
invited.
The
publicly
stated
reason
was
that
the
WHATWG
lacked
a
patent
policy
[Wilson]
.
Dean
Edwards,
a
Internet
Explorer
ex-
pert not affiliated with Microsoft, joined the
WHATWG
later
[NewMember]
.
Even
though
the
group
of
WHATWG
members
is
invite-only,
anyone
is
allowed
to
join
the
WHATWG
mailing
list
and
contribute
technically,
which
makes
the
pro-
cess
open.
The
editor
acts
as
a
benevolent
dictator
who
writes
the
specifications
tak-
ing
into
account
the
contributions.
The
WHATWG
members
“provide
overall
guid-
ance”
[WHAT-Charter]
,
which
means
the
power
to
impeach
and
replace
the
editor
of the specifications.
Microsoft
is
not
expected
to
implement
the
WHATWG
specifications
in
Internet
Explorer
in
the
near
term.
Instead,
the
implementations
of
the
WHATWG
specifica-
tions
for
IE
are
expected
to
be
built
by
teams
not
affiliated
with
Microsoft
using
the
extensibility mechanisms provided by Microsoft in
IE
.
[IEcompat]
I
share
the
view
of
the
Web
that
holds
WebKit,
Presto,
Gecko
and
Trident
(the
engines
of
Safari,
Opera,
Mozilla/
Firefox
and
IE
,
respectively)
to
be
the
most
im-
portant browser engines.
2.3.3
The
WHATWG
Specifications
The
WHATWG
has
two
specifications
in
development
and
another
two
that
are
ex-
pected in the future.
[WHAT-Charter]
14
AN HTML5 CONFORMANCE CHECKER
The
two
specifications
being
developed
are
Web
Forms
2.0
[WebForms2]
and
Web
Applications
1.0
[WebApps]
.
Web
Forms
2.0
is
an
update
to
HTML
4.01
forms.
Web
Applications
1.0
is
a
re-specification
of
HTML
that
both
constrains
and
ex-
tends
HTML
.
The
language
specified
by
Web
Forms
2.0
and
Web
Applications
1.0
taken
together
is
referred
to
as
(X)HTML5
.
It
is
expected
that
Web
Forms
2.0
will
be
eventually be folded into the Web Applications 1.0 specification.
The
two
expected
future
specifications
are
Web
Controls
1.0
for
creating
new
widgets
and
CSS
Rendering
Object
Model
for
defining
programmatic
access
to
the
CSS
rendering tree.
[WHAT-Charter]
Web
Forms
2.0
.
Web
Forms
2.0
extends
HTML
forms
with
new
features.
The
HTML
forms
as
of
HTML
4.01
are
considered
“Web
Forms
1.0”.
Web
Forms
2.0
is
not
a
standalone
specification.
Instead,
it
specifies
updates
to
HTML
4.01
and
the
DOM
.
The
choice
of
updates
is
based
on
what
has
been
identified
as
common
needs
and
what
can
be
implemented
as
a
script-based
library
for
Internet
Explorer
[IEcompat]
.
The
most
obviously
visible
new
features
are
new
input
field
types.
For
example,
there
are
new
inputs
for
dates
that
can
be
implemented
in
browsers
by
popping
up
a
platform
specific
calendar
widget.
The
new
input
types
are
backwards
compatible
in
the
sense
that
unknown
input
types
degrade
into
text
inputs
in
legacy
browsers.
Simple
constraints
on
the
values
of
the
input
field
can
be
declared
and
checked
by
the
browser
without
the
form
author
having
to
resort
to
scripting.
For
complex
re-
strictions, new integration points for scripts are provided.
In
addition
to
the
new
input
types,
there
is
also
a
repetition
model
for
adding
and
removing
repeating
sets
of
fields
from
the
form
without
scripting.
A
new
XML
form
submission
format
in
introduced.
The
format
can
also
be
used
for
pre-loading
values into the form fields.
Web