The Prototype ComparaGRID Ontology

wafflebazaarInternet and Web Development

Oct 21, 2013 (3 years and 7 months ago)


The Prototype ComparaGRID Ontology

The Prototype ComparaGRID Ontology..............................................................................1
1 The ComparaGRID Project..........................................................................................2
1.1 ComparaGRID – enabling GRID technologies for comparative genomics.........2
1.2 Objectives of the ComparaGRID project.............................................................3
2 Ontologies....................................................................................................................4
2.1 What is an Ontology?...........................................................................................4
2.2 What is OWL?......................................................................................................5
2.3 Tools for Ontology Development.........................................................................7
3 The Design of the ComparaGRID Domain Ontology..................................................8
3.1 The Role of the ComparaGRID Ontology............................................................8
3.2 The Domain and Scope of the ComparaGRID Ontology.....................................8
3.3 Modularising the Ontology.................................................................................10
3.4 Designing the ‘Upper Ontology’........................................................................10
3.4.1 Concepts vs. Refinements..........................................................................10
3.4.2 ActionRoles................................................................................................11
3.4.3 Modularising Evidence...............................................................................11
3.4.4 Names, IDs, Labels, Definitions and Metadata..........................................11
4 The ComparaGRID Model of Genomic Mapping......................................................17
4.1 The Major Abstract Mapping Concepts.............................................................17
4.2 Maps...................................................................................................................18
4.3 Markers...............................................................................................................21
4.4 Locations, Mappings and Positions....................................................................21
5 BiologicalEntities in the ComparaGRID Domain Model...........................................23
6 ExperimentalEntities in the ComparaGRID Domain Model......................................25
7 TaxonomicEntities in the ComparaGRID Domain Model.........................................26
8 Capturing Relationships between Concepts in the ComparaGRID Ontology............29
9 Glossary......................................................................................................................35

1 The ComparaGRID Project

1.1 ComparaGRID – enabling GRID technologies for comparative

Vast amounts of genomic mapping and sequence data are being deposited in
database resources across the Internet, from an ever-expanding range of
economically and scientifically important organisms. This data provides a huge
potential resource for information discovery. It could be particularly productive to
exploit these resources to assist in the genomic characterisation and mapping of
poorly characterised species using information from other species (either the well
characterised model organism species: Human, Mouse, Arabidopsis, Yeast etc.; or
by exploiting data from potentially more informative closely related species).

Integrating genomic data across species boundaries is critical to the successful
exploitation of previous research investment in this area. Systematic attempts at
data integration have to date been focussed on single species (e.g. annotating the
genome of one species using functional data from a second). Combining data
across species and across datasources by traditional ‘warehousing’ approaches
(according to some generalised schema) would fail to provide the flexibility to
explore the combined data according to multiple alternative views.

ComparaGRID seeks to develop a new GRID-based system to capture the details
of relationships between genomic data either within or across species in a way that
will enable complex ad-hoc queries to be run across distributed datasets. A
successful prototype system will demonstrate that the underlying raw data can be
combined and queried to draw maximum benefit from those data for all genomic
communities. The basic premise is to exploit the conservation of gene sequence
and gene order through evolutionary history, by identifying regions of conserved
synteny (orthologous gene placement) between the chromosomes of related
species. Hence genome-mapping data in one species should inform the genome
mapping of other species. However this approach depends on the accurate
semantic interpretation and integration of data in disparate sources, across species
1.2 Objectives of the ComparaGRID project

1. Domain Modelling

To define an ontology or controlled vocabulary for the domain of comparative
genomics, semantically describing the:

- Mapping Terminology and Relationships
(Markers, Linkage, Synteny etc.)
- Genetic and Molecular Concepts and Relationships
(Genes, Polymorphisms, Sequence Similarity etc)
- Evolutionary Relationships
(Homology, Orthology etc.)
- Containment relationships
(part of etc.)
- Nomenclature Relationships
(Gene A in Species X = Gene B in Species Y)

2. System Architecture Design and Implementation

I. To develop drop-in data wrappers for primary and comparative data
sources (databases), across a number of animal, plant and microbial
species. These will represent/transport the mapping data in terms of the
defined ontology/terminology, facilitating integration of information.

II. To implement a Web/GRID middleware layer that will support
operations over the wrapped databases including the integration and
query of data by reference to the controlled vocabulary or ontology.

3. Functionality

I. To demonstrate practical applications based on those web services.

II. To address biologically relevant questions e.g. to assist in identifying
candidate genes underlying a Quantitative Trait (QTL) in farm animals
or crop plant species.

III. To use existing comparative genomics knowledge to infer further
comparative observations and stimulate hypothesis-driven experiments
(for example, to predict gene presence and order on the chromosomes
of a poorly mapped species, using mapping data from other better
characterised species).

2 Ontologies
2.1 What is an Ontology?

In general terms an ontology is a formal specification of a domain of knowledge.
In the case of ComparaGRID this is the domain of comparative genomic mapping.
Informal ontologies may be little more than controlled vocabularies, which define
the labels and terms that can be used to describe and represent knowledge in that
domain. More formal ontologies capture more specifically and explicitly the
meaning or semantics of terms, and how they relate to each other.

While building a formal ontology does not necessarily look any different from a
less formal ontology, there is normally some underlying mathematical or logical
representation. This allows you both to constrain how terms may be used, to
ensure that they are not used erroneously (e.g. a formal ontology might well notice
that “bacterial protein localises in the nucleus” is biologically nonsensical, while
an informal one would not), but also allows automatic inferences to be drawn from
the information not stated explicitly (e.g a formal ontology might realise that
“protein localises in nucleus” must therefore be a eukaryotic protein). Both of
these features become more and more useful when the ontology reaches a
significant size.

A formal ontology is composed of defined concepts (or classes) – and defined
relationships (or properties) between defined concepts. Expressing these
definitions formally using subsumption (subtyping), axioms and restrictions (i.e.
rules and statements about how terms are defined and related) allows automated
logic reasoners to check that information represented in the ontology language is
‘consistent’ (i.e. does not break the constraints of the formalism) and to make
‘inferences’ about what other relationships between the concepts described must
be true as a consequence of the rules captured in the formalism.

Insert1: Using Formal Ontologies for reasoning

Consider a Domain Ontology that defines a relationship that can be expresse
between two concepts, so we can represent knowledge such as:

Concept A is related to Concept B
& Concept B is related to Concept C

Can we deduce/reason anything about possible relationships
etween Concepts A
and C?

For example:

In species X

Gene A is syntenic with Gene B
& Gene B is syntenic with Gene C

If the ontology defines the relationship is syntenic with to be Transitive (i.e. i
propagates) and Symmetric:

We can automatically deduce that the following must be true:

through transitivity: Gene A is syntenic with Gene C
through symmetry: Gene C is syntenic with Gene A
& Gene B is syntenic with Gene A
& Gene C is syntenic with Gene B
2.2 What is OWL?

OWL (The Web Ontology Language) is a W3C Web standard for representing
ontologies – developed for the 'Semantic Web'
. OWL is designed for use by
applications that need to process the content of information instead of just
presenting information to humans.

OWL documents are represented using an XML mark-up language, where the
‘tags’ are used to structure and define the information content (semantics) of the
document. (By contrast HTML and XHTML are mark-up languages that allow

"The Semantic Web is an extension of the current web in which information is given well-defined
meaning, better enabling computers and people to work in cooperation" -- Tim Berners-Lee, James
Hendler, Ora Lassila,
The Semantic Web
, Scientific American, May 2001

structuring of the information content of documents, mainly for presentation

OWL has three increasingly expressive sub-languages: OWL Lite, OWL DL
(DescriptionLogics), and OWL Full. The distinction between these languages
comes from the basic language features that the system provides. They are defined
such that a valid OWL Lite ontology is also a valid OWL DL and OWL Full
ontology. In practice, limiting the language features to OWL DL has the advantage
that we are able to use software reasoning (inferencing and constraints described
earlier). OWL Full is currently more complex than reasoners are able to cope

The construction of an OWL DL ontology provides vocabulary and formal
semantics allowing reasoning over the ontology, and over data that are instances
(or individuals) of concepts defined in the ontology. Computer programs can
reason over the data to test that it is consistent with the syntax and semantics of the
OWL specification, and to infer new relationships that must be true. In this fashion
inconsistent data can be exposed, and new information discovered automatically.

An OWL ontology is composed of Classes (or ‘concepts’) and Properties
(otherwise known as ‘relationships’, ‘slots’, ‘roles’, ‘attributes’). Classes are
organised in a ‘subsumption’ hierarchy, using the subclass relationship. This
relationship may variously be interpreted as ‘isA’, ‘isATypeOf’, ‘isASubclassOf’
BUT NOT ‘isAnInstanceOf’. For example, a <SportsCar> is a type of <Car>,
which is a type of <Vehicle>. However, <MySportsCar> is an instance of the
Class <SportsCar> (i.e. an individual of Type SportsCar), it is also an instance of
<Car> and <Vehicle> as a consequence of the subsumption hierarchy. OWL can
represent individuals such as <MySportsCar> as Individuals of a Class, which
must satisfy all the axioms and restrictions of that Class (and because OWL can be
reasoned over – individuals can be ‘classified’ on the basis of which axioms and
restrictions they satisfy).

OWL (Object) Properties are binary relations on individuals – i.e. they link two
individuals together (Datatype Properties may link an individual to a datatype
value, such as a number or text string). The Domain and Range of OWL Properties
can be defined on Classes, and Properties may also form a simple subsumption
hierarchy so that a Property that is the subtype of another, inherits its parent’s
domain and range, although these can be specialized or further restricted. Hence a
Property hasComponent might have the Domain <Vehicle> and the Range
<Vehicle Component>, whilst the more specialized property hasEngine might
have Domain <Vehicle> inherited from hasComponent, but Range <EngineType>,
where <EngineType> is a subclass of <VehicleComponent>. Properties can have
various characteristics, they may be ‘Symmetric’, ‘Transitive’ and ‘Functional’
(i.e. single valued); inverse pairs of Properties can also be defined (e.g.
hasComponent / IsComponentOf).

Because OWL Properties are by their nature binary, they can only link two
individuals, and cannot be parameterised. It is possible, and more flexible, to
model Relationships as Classes. For example, we can define a Class <Homology>,
and allow <Homology> to satisfy the Domain of a Property relates with Range
<Gene>. In practice this allows a given instance of <Homology> to be related to
multiple <Gene> individuals, thus expressing an ‘n’-ary relationship. Similarly it
is possible to allow parameterized relationships if the relationship is represented as
a Class, which may have multiple allowed Properties (i.e. by satisfying the range
restriction of the Properties). An example of a parameterized relationship
represented as an OWL class with multiple properties could be <DNA
SequenceSimilarity> hasScore <number> hasMeasure <ComparisonAlgorithm>
hasUnit <UnitTypePartition> relates <DNASequence> which can represent the
relationship between two (or more) sequences parameterized with various score
details. However, the actual details of any recorded scores would be semantically
opaque within OWL as reasoning over datatype values is poorly supported.
Similarly the logical consequences of placing cardinality restrictions on properties
are complex (e.g. asserting that there must be 2 relates <DNASequence>
properties in the above example).

2.3 Tools for Ontology Development

The creation and editing of large ontologies is facilitated by tools that provide a
graphical interface to assist the specification of the concepts and relationships in
the ontology. Protégé is an open-source development environment developed at
Stanford for ontologies and knowledge-based systems. The OWL Plug-in
developed at Stanford and Manchester is an extension of Protégé with support for
the Web Ontology Language (OWL).

Specifically, Protégé OWL, with various additional plug-in components allows an
ontology to be built and tested in a user-friendly graphical fashion, and the
resultant ontology to be saved as a valid OWL-DL RDF/XML file. Protégé can
also connect to various external ‘reasoners’ (such as Racer and FACT) that can
check the consistency of classes and individuals specified in the ontology, and
reason over these concepts and individuals to reclassify them according to the
ontological rules (i.e. infer new knowledge).

3 The Design of the ComparaGRID Domain Ontology
3.1 The Role of the ComparaGRID Ontology

We have built a prototype ontology expressed in OWL DL that seeks to capture
the concepts and relationships represented in the data found in genomic mapping
resources. The role of the ontology is to provide the semantic glue that will allow
data from one source to be meaningfully integrated with data from completely
different datasources. For example, mapping data represented in the Roslin ARK

database is stored in a bespoke database structure with implicit semantics (i.e. the
database schema). However, the objects represented in ARK (genes,
chromosomes, maps, positions etc.) are conceptually the same as those stored in
the human genome maps of ENSEMBL and NCBI, which have their own data
model and definitions. In order to relate objects from ARK to those in these other
data sources it is necessary to semantically define the objects, so that appropriate
concepts are compared. This can be achieved by mapping the data in each
datasource to a shared conceptual view – i.e. the ComparaGRID ontology. Hence
it should be possible to meaningfully compare gene maps from cattle and humans.

The mapping of multiple distributed datasources to a common shared conceptual
view is often achieved by mapping to a common transfer schema – where all the
data can be structured according to a defined global view. This allows data to be
queried, retrieved and exchanged through the common schema. However, such an
approach relies on the computer applications that seek to use the combined data
’knowing’ and ‘interpreting’ the shared schema structure. The potential power of
using an OWL DL ontology to represent data for query and transfer is that the data
is intrinsically semantically defined. OWL expressions (i.e. Description Logics
expressions) can semantically query the data, and computer reasoner engines can
both check data for consistency (semantic truth according to the ontology) and
infer new relationships based on those assertions stored in the datasources.
Furthermore, an OWL based system should enable more flexibility and
evolvability than an XML schema based one.

Genomic data mapped to the OWL ontology is actually represented as an instance
of the ontology – allowing software developed to process OWL ontologies to
process the genomic data as an ontology. However, OWL-naïve applications will
also be able to use the data, using the OWL representation as an extremely well
defined transfer schema.

3.2 The Domain and Scope of the ComparaGRID Ontology

For a given application it is important to carefully consider the limits and extent of
the domain ontology, otherwise the detailed domain model can expand
exponentially to include concepts that are only peripherally important to the
domain in question. It is important therefore only to define the concepts and

The ARKdb database system provides comprehensive public
repositories for genome mapping data from farmed and other animal species.
relationships necessary to do the data manipulations that we require the system to
support. Furthermore, it is only necessary to define the concepts in terms relevant
to the application domain; exhaustive representation and definition of every
concept is neither feasible nor necessary. To some extent it is only necessary to
model ‘truth as we need to see it’, that is we only need to model the relationships
and concepts as they apply to our domain, without necessarily precisely or fully
representing the biological truth, i.e. certain subtleties may be irrelevant for our
purposes. However, if we wish the ComparaGRID domain model to be useful to
other applications and users, or we wish to integrate it with other domain models
(e.g. the Gene Ontology) we need to represent our domain according to a shared

The concepts necessary for comparative mapping include: Biological Entities
(Genes, Genomes, Chromosomes, Gene Products etc.), Evidential Entities
(Citations, Authorships, Experimental Techniques etc.), Mapping Entities (Maps,
Markers, Locations etc.), Experimental Entities (DNA molecules/sequences,
clones, libraries etc.), Bioinformatics Entities (representing database records of
protein and DNA sequences etc.), Taxonomic Entities (representing the
organisms being mapped). We will also represent concepts relevant to the system
architecture, such as the services provided, but these are not considered here.

Some of these concepts are defined in detail in the ontology; others are on the
periphery of its scope. For example we do not attempt to represent Phenotypes in
any qualitative way (phenotype and trait ontologies being complex problems being
tackled elsewhere). It will be useful for us to know that two objects are linked to
the same or similar phenotype – but we will not represent details of the phenotype
itself in the ontology at this stage. However, the Gene Ontology
(GO) is a mature
and widely adopted representation of Gene Function, so we can choose to import
on OWL representation of GO into our ontology, or more simply reference GO
concepts by their GO ID, thus allowing us to lever the concept definitions and
relationships in GO to represent Gene Function in ComparaGRID.

Other orthogonal ontologies have been created under the OBO
umbrella, in the
wider domain of molecular genomics, which it may be useful to import or
reference in the ComparaGRID domain ontology. The Sequence (Annotation)
Ontology (SO)
probably goes beyond the level of detail necessary for the initial
use cases planned for ComparaGRID, but is available if we need to describe
elements of sequence and gene structure in fine detail. On the other hand the
Evidence Code Ontology
is useful as a direct plug-in that can be used to detail
the types of evidence that support mapping data and assertions. It will be possible

The Gene Ontology project provides a controlled vocabulary to
describe gene and gene product attributes in any organism
Open Biomedical Ontologies is an umbrella web address for well-
structured controlled vocabularies for shared use across different biological and medical domains.
aims to develop an ontology suitable for describing biological
A rich ontology for experimental and other
evidence statements.

to extend the imported Evidence Code Ontology if we require more detailed and
relevant annotation for our domain.

3.3 Modularising the Ontology

As described above it is possible to import or reference separate ‘plug-in’
ontologies into a single ComparaGRID ontology (using XML namespaces). Using
this principle the ComparaGRID ontology has been designed as a set of separate
modules (Bioinformatics, Biology, Service, Relationship, Evidence, Structure
etc.). The primary module discussed in this document is the Biology Module

However, it is worth noting that the process of ontology modularlisation is
relatively untested and tool support for it (in Protégé) is in its infacy. This reflects
an underlying uncertainty about how modularisation is best achieved, and a lack of
standard design models. In this regard, ontological technology is quite a few years
behind most programming languages.

3.4 Designing the ‘Upper Ontology’

In an attempt to make a well-designed and maintainable ontology we are following
current best practice design patterns, according to the Manchester ’school’. As we
develop, use and test our prototype ontology we may further develop or discard
these approaches as we determine their utility.
3.4.1 Concepts vs. Refinements
A number of design patterns have been proposed by which to organize the
structure of the ontology
, with the aim of creating clean ‘untangled’ ontologies,
with related concepts grouped into discrete branches of the type hierarchy, and
with minimal asserted dual parentage of concepts. A fundamental distinction
between concepts in an ontology is between ‘Self-Standing Concepts’ (most
“things” in the physical and conceptual world) and ‘Refinements’ (value types and
values which partition conceptual spaces e.g. “small, medium, large”). The
ComparaGRID Ontology implements a modified version of this distinction where
Domain Objects are divided into Domain Concepts (all the physical and abstract
concepts that we are interested in representing) and Domain Concept Refinements
(to hold values, and value partitions, which might apply to Domain Concepts, and
some other concepts that do not appear to be central to our understanding of the
domain, but necessary to represent data) [See Figure 1]. Refinements currently
represented in the ontology are shown in Figures 2 and 3; if this separation is
successful it may be useful to move other partitions of value type into this

See for example “Patterns, Properties and Minimizing Commitment: Reconstruction of the
GALEN Upper Ontology in OWL", Alan L Rector and Jeremy Rogers, Core Ontologies
Workshop (CORONT) in conjunction with the European Knowledge Acquisition Workshop
(EKAW-2004), Northampton, UK
3.4.2 ActionRoles
Another potentially useful separation of concepts in the ontology is the use of
ActionRoles. If the same concept fulfils several different roles – it appears
reasonable to place it in several different places in the type hierarchy. For example
a characterized DNA sequence can perform a variety of roles: it might be a
hybridization probe in some experiments, it might be the sequence of a gene, or it
might encode a protein product etc. This approach has been followed in many
previous ontologies, e.g. in GO and Evidence Code Ontology concepts can have
multiple parents. Avoiding this pattern should prevent the ‘tangling’ of the
ontology and assists maintainability and modularisation. Action Roles give the
possibility of separating classification on the basis of role from the subsumption
hierarchy, thus maintaining discrete axes or skeletons of type hierarchy. For
example if a concept can participate in the property <DomainConcept>
hasActionRole <ActionRole> and if there is a hierarchy of specific ActionRoles
we can specify different Roles for a Concept in different instance individuals. (e.g.
<Oligonucleotide> hasActionRole <HybridizationProbeRole> and <Oligo-
nucleotide> hasActionRole <PCRPrimerRole>, thus <Oligonucleotide> only
occurs in the type hierarchy once as a subtype of <CharacterizedDNASequence>.)

3.4.3 Modularising Evidence
As described above (§
) we can choose to break up our ontology into separate
modules, which can be saved as separate OWL files and imported into an
integrated ontology using XML namespace inclusion. Currently we are
considering modularising evidential concepts into a file ‘evidence.owl’ using the
import namespace ‘evd:’, but have temporarily retained the EvidentialEntity
Hierarchy in bio.owl [see Figure 5]. Note that Person, Citation and Institution are
all EvidentialEntities, but that they could be assigned different ActionRoles under
different circumstances (§
). For example an individual Person might be
assigned either an ExperimenterRole or an AuthorRole, or an Institution might be
assigned these Roles or a Publisher or DataProviderRole.

The Evidence Code Ontology module (refactored from OBO, see section §
) has
been imported using the obo: namespace in the evd: namespace. These ‘Codes’
describe the types of evidence qualitatively, and could form the basis for deciding
on the reliability of data according to code type. Note that OBO has used multiple
inheritance in its EvidenceCode type hierarchy, leading to a tangled ontology (e.g.
‘inferred_from_similarity’ has two parents).

3.4.4 Names, IDs, Labels, Definitions and Metadata
By default the Protégé OWL editor identifies concepts using a text label (in the
syntax of OWL RDF this is the rdf:ID attribute; this rdf:ID attribute is actually
used as an XML namespace when cross-referencing between concepts). Using a
text string as the ID provides readable visualisation of the named concepts in the
display interface, but does restrict those characters that can be used in ID names to
alphabetic and numeric characters and no spaces. The actual proper spelling and
format of the concept name can be saved more accurately in the rdfs:label
attribute, which could also be used to store multiple aliases or language variants.
Other ontologies use a numerical identifier for concepts, which provides poor
readability unless the Protégé display interface is changed. For examplein order
that the imported EvidenceCode ontology displays in a readable manner, the
numerical Evidence Code rdf:IDs have been replaced with a text rendering of the
rdfs:label attribute. GO also uses numerical identifiers.

ComparaGRID aims to provide a clear unambiguous textual definition of all the
concept terms in the ontology, i.e. each class and property should have a definition
which is currently stored as an rdfs:comment. These definitions have been gleaned
form online dictionary and glossary resources, or provided by ComparaGRID.
These definitions are not complete and might benefit from community input.
Information (metadata) about electronic resources is often provided using a set of
fields known as the Dublin Core metadata set
. ComparaGRID has imported an
OWL representation of Dublin Core with the namespace ‘dc:’ which allows these
fields to be used as annotation properties equivalent to the rdfs:label and rdf:ID for
the description of classes. Available fields to describe any class include the
repeatable fields dc:creator, dc:source, dc:publisher, dc:description, dc:date,
dc:source and dc:contributor. It might be more appropriate to store metadata and
definitions in these Dublin Core fields.

Figure 1: The Separation of Domain Objects into Concepts and Refinements
Note: ‘owl:’ and ‘bio:’ are namespace qualifiers and reflect that the concepts are
defined in the OWL or BIO ontology modules.

The Dublin Core Metadata Initiative is an open forum engaged in the
development of interoperable online metadata standards that support a broad range of purposes and
business models.

Figure 2: Domain Concept Refinements (Detail 1)

Figure 3: Domain Concept refinements (Detail 2: Value Partitions)

Figure 4: Action Roles created to prevent ‘tangling of the ontology

Figure 5: Modularisation of Evidence
4 The ComparaGRID Model of Genomic Mapping

In order to provide an inclusive and extensible model of genomic mapping we
have derived a highly abstracted model of mapping, which allows us to represent
all different types of mapping data with precise semantics. The model is
necessarily complex so that we avoid misrepresentation of observational data
through over-simplification.
4.1 The Major Abstract Mapping Concepts

Figure 6: The Major Abstract Mapping Concepts

The abstract (typically linear) representation of an informational macromolecule or
chromosome etc., allowing the positioning of identifiable markers along the length
of the map. An ordering of markers on a chromosome. A physical, linkage or RH
map of a chromosome, with absolute positions and distances for physical maps,
and distances based on recombination or break point frequency for linkage and RH

The experimentally observable entity, phenotype or protocol that allows the
inheritance and mapping of genetic entities to be assayed. A marker is the label or
identifier for some assayable feature that can be positioned on a Map. The Marker
can be considered a proxy for the actual underlying feature (DomainConcept)
being mapped.

The relationship between a Position (ordinal or co-ordinate) and a Map; i.e. a
position/site on a map.

The coordinates of a Location on a map. Includes Coordinate and purely Ordinal
Positions, and may be associated with a ScaleUnit.

The association between a Marker (or set of Markers) and a Location.


Figure 7: The ComparaGrid Model of Genomic Mapping. OWL Classes are shown in
boxes, and properties are labelled arrows. The hasValue property is a datatype
property recording a floating-point number; it has subtype properties
hasStartValue, hasEndValue and hasMidPointValue for greater specificity. The
marker Association property can relate to any DomainConcept – although
semantically redundant it may be useful to subtype this relationship according to
the type of object linked (Phenotype, Gene etc.)

4.2 Maps

Figure 8 shows the Map type hierarchy in the ComparaGRID ontology. Each Map
is the mapOf some BiologicalEntity that can be classified as
RepresentableByAMap, and will have some ScaleUnits.

Maps can be classified according to their abstract type, which is determined by the
type of ScaleUnits used: ScaleMaps have ScaleUnits, OrdinalMaps have Ordinal
(or Scale) Units whilst UnorderedMaps have neither ScaleUnits nor OrdinalUnits.
An alternative classification of Maps reflects their experimental derivation; there
are three main types of Experimental Map: Physical, Genetic and Cytogenetic. We
may also need to represent Integrated maps (that have mixtures of physical,
genetic or cytogenetic markers) and comparative maps that are integrated from
more than one species data.

Maps can be thought of as sets of all the Locations that are on(that)Map. Each
Location has a Position on a Map. Mappings associate a Marker or set of Markers
with a Location and each of these Locations consists of one or more Markers and a
Positioning. Therefore a Map can also be considered to be composed of the set of
all the Markers that lie on that Map through the Mapping and Location
Relationships. Markers can be associated with various genetic, phenotypic or
cytogenetic entities that underlie the observable marker inheritance; therefore the
map is also composed of this set of entities. Because Maps are the mapOf some
Biological Entity, these sets of marker and genetic entities etc. can also be related
as components of the Biological Entity that is being mapped (e.g. a specific

In our model, actual DNA (and protein) sequences (i.e. the base sequence
represented as a string of nucleotides ‘ACGT…’) are represented as Physical
maps. Nucleotides play the Role of Markers in this model, and are Mapped to a
Location that has a Position on the Map. This extremely abstract representation of
sequences allows us to integrate and compare maps that are based on DNA
sequence with those based on non-physical measures, such as genetic
recombination frequency. This integration will be necessary to meaningfully
compare, for example, a chicken genetic map with the human genome map.
However, we need to consider whether this highly abstract representation of
sequences is overly complex for making sequence to sequence relationships which
might be captured more directly: Consider an assertion such as ‘Sequence1 is a
subsequence of Sequence2, from bases x to y’, to represent this in our abstract
model would require Sequence 1 to play a Marker role (to be associated with a
Marker), and be Mapped to a Location on a Map of Sequence2, with Position
(x,y). Clearly a more simple and direct additional representation may be desirable
for sequence-to-sequence mapping.

Figure 8: The Map Type Hierarchy
4.3 Markers

Markers are a completely abstract notion that is incorporated into our domain
model to represent an ‘experimental observation’ that is Mapped to some
Location. A Marker is some experimentally observable entity, phenotype or
protocol that allows the inheritance and mapping of genetic entities to be assayed.
The marker is not a gene, polymorphism etc. itself – but the notion of the
observable phenomenon – that may be linked to an underlying gene, sequence etc.
This level of abstraction provides for extra accuracy and reliability in recording
data, but often data sources record the placement of the underlying feature directly
on a map, so that the marker abstraction is lost.

For example a given Microsatellite-Marker, will have various experimental
parameters and evidence, and may be linked to a known sequence (e.g. an STS)
and may in turn be linked to a known gene sequence. But, the Marker is neither
the STS nor the Gene themselves. Legacy data often records a gene name as the
marker – but this is an oversimplification and loses valuable detail and
information. Our model will be able to exploit 'better' data where the intermediate
links between the marker and a defined sequence or gene are explicit, but can still
represent the less specific, lower quality legacy data.
The model does not directly Map an STS, Gene, Phenotype etc. because this
oversimplifies the reality that a Mapping is an indirect observation (until you reach
the level of direct DNA sequence placement on a DNA Sequence Map, as in the
human genome sequence). Furthermore, without the Marker abstraction we cannot
distinguish contradictory mappings. Consider a mapping of some STS based
Marker to a Location in two separate experiments, which give different results. If
we directly associate the STS or Gene name with the Location it becomes harder
to resolve the discrepancy. If for example at a later date it became apparent that
the STS mapping technique or reagents were deficient in one of the datasets,
having represented this mapping directly we do not have any way of
distinguishing the erroneous placement. An example of this is given in Figure 9.

4.4 Locations, Mappings and Positions.
Positions can be Coordinate or Ordinal (i.e. where position is just given as an
order). The Scale Units are a property of the Map. It may be appropriate to
represent Ordinal Positions as Coordinates with no Scale Units. Coordinate
Positions might be of Interval or Point type. The value of the Position is a floating-
point number, and might be a StartValue or an EndValue, or possibly a mid-point

Locations are simply the binary relations between a Position and a Map, which we
have chosen to represent as an OWL Class (therefore allowing parameterization
using other properties – such as hasEvidence).

Mappings are the relations between a Location and one or more Markers (for
instances where a group of markers map indistinguishably at the same Location.)
Again this OWL Class representation of a relationship can be parameterized by
evidence, authorship, provenance details etc.

Figure 9a: The complex ComparaGRID model for representing the mapping of
markers on maps. The marker is associated with Sequences and Genes: therefore
the two mappings of markers x and y are readily distinguishable. Location =
(position+map); Mapping = (marker+location); markers are associated with /derived
from sequences and genes.

Figure 9b: A simple representation of Mapping data, which fails to capture the
details of experimental observation, and the semantics of the link from marker to
gene. Representing a gene directly as a marker loses the distinction between two
possibly contradictory mappings for gene B.

5 BiologicalEntities in the ComparaGRID Domain Model

BiologicalEntities represent the biological objects that fall within our domain of
interest. Many of these might ultimately be identified in ComparaGRID data by
reference to a proxy BioinformatiticsEntity (i.e. by reference to a unique identifier
for the object as stored by Genbank, SwissProt etc). Currently these bioinformatics
identifiers can be represented by the class BioinformaticsEntity, however, it is
ultimately desirable to separate this representation from the main ontology module
(bio.owl) to a separate ontology module (Bioinfomatics.owl) that could be
separately maintained and updated.

• Anything that can have a map
drawn of it: in our domain
anything with DNA or RNA
comprising the genetic
• Direct and indirect products of
gene transcription: i.e.
transcribed and processed
RNA, and translated proteins.
• Anything that can have a map
drawn of it: in our domain
anything with DNA or RNA
comprising the genetic
• Various forms of variation and
polymorphisms that might be
assayed as Markers in
Mapping Studies
• Containers for concepts that
are somehow related

Figure 10: Biological Entities

The Biological Entities have been classified into a number of separate trees as
shown in Figure 10: Heritable Entities (such as structures, genes and phenotypes,
see Figure 11); Gene Products (RNAs and Proteins, see Figure 12); Genetic
Variation (DNA and AminoAcid Polymorphisms etc., see Figure 13); and
Biological Groups (Gene and Protein Families etc., see Figure 14). In the ontology
we have represented a ‘Defined Class’ called ‘RepresentableByAMap’. In our
domain we have asserted that anything that is composed of a Biological
Macromolecule can be represented by a Map, where in our domain a Biological
Macromolecule is limited to include Nucleic Acids and Polypeptides (see §
). In
practice this asserts that Genes, Chromosomes, DNA molecules, RNA molecules,
Proteins etc. can all have maps drawn of them (and infact can themselves be used
as markers on larger scale maps).

Figure 11: Partially expanded hierarchy of Heritable Entities, a disparate group of
things that can be inherited (and will therefore be assayable as markers in Mapping
Studies). Genes may be subtyped into Candidate Genes, Predicted Genes,
Pseudogenes etc. Gene Features include Enhancers, Promoters etc. Various Types
of Nuclear Chromosomes include Sex Chromosomes, Autosomes etc.; Mobile
Genetic Elements include Plasmids, Viruses, Transposons; Functional
Chromosome Regions include Inactive and Inactive Regions and Cytogenetic
Regions include Telomeres, Centromeres etc.

Figure 12: Partially expanded hierarchy of Gene Products. Transcription RNAs
include tRNA and rRNA. Small RNAs include sn, sno, mi, si RNAs etc.

Figure 13: Partially expanded hierarchy of the molecular basis of Genetic Variation.
Again these features may be assayed to reveal inheritance patterns in Mapping
Studies. The various forms of DNA mutations and Chromosomal rearrangements
are classified here.

Figure 14: Ad hoc containers for concepts that are somehow related - but not
necessarily by simple ontological properties (which automatically form logical sets
by shared property/definition).
6 ExperimentalEntities in the ComparaGRID Domain Model

Physical, as opposed to Biological, Entities are classified as Experimental Entities
in the ComparaGRID ontology (see Figure 15). This includes the representation of
cloned and characterized DNA “Sequences”. Because DNAClones and
CharacterizedDNASequences are composed of DNA, they are also ontologically
classified as ‘RepresentableByAMap’ (see §
). More specifically these DNA
Molecules can be represented by a DNASequenceMap, which represents the actual
base sequence of the molecule. Note that DNARegions are types of
CharacterizedDNASequence, but are also partOf a CharacterizedDNASequence.
The partOf relationship is transitive.

Figure 15: Partially expanded hierarchy of Experimental Entities. These represent
the physical-chemical entities of our domain. Entities that are composed of DNA are
by definition ‘RepresentableByAMap’.

7 TaxonomicEntities in the ComparaGRID Domain Model

Any biological data refers to some taxonomic group that the experimental data is
derived from. Generally genomic data refers to some species, subspecies, strain or
population. Unfortunately there is no standardised, uniform ‘list’ of “true” species;
nor a single unambiguous taxonomic classification of the species into higher
groups: genus, family etc. Taxonomy, particularly in some clades, is somewhat
contentious and unresolved, and there is frequent disagreement about the relative
placement of species and subspecies. ComparaGRID may need to represent and
integrate mapping or sequence data from any clade – so have to be able to cope
with these unpleasant taxonomic realities.

ComparaGRID needs to capture the taxonomic identification of the source of
genomic data so that it can integrate mapping information from ‘the same species’.
ComparaGRID would also benefit from ‘understanding’ taxonomic relations, i.e.
from ‘knowing’ that a subspecies is included in a species, so that sibling
subspecies are necessarily closely related, or that members of the same genus are
more related than members of separate genuses. However, dealing with biological
data that is poorly marked up, and in an unknown taxonomic context, is
problematic: often all that is included with a dataset is an (incomplete) taxonomic
name (or even a common name).

ComparaGRID should be able to represent the taxonomic information as it is
recorded, but also try and resolve some simple taxonomic ambiguities. For many
of the common model organisms the ambiguities are limited, and much data might
be ‘assumed’ to be marked up according to NCBI’s taxonomic indexing system.
However, even with this assumption there are still discrepancies in the use of
alternative names, and in representing data at the subspecies or species level.

The prototype ontology (Figure 16) includes representations of Scientific Names
(Taxon Name), Ranks and Kingdoms. Where the taxonomic context is known,
this can be captured using the representation of TaxonConcepts, that have a
TaxonName (e.g. NCBI’s Taxon Concepts, Figure 16). The ontology provides pre-
canned TaxonNames and ‘ComparaGRID’ TaxonConcepts for model organisms
etc., the taxonomic context for these model organisms might be asserted to be that
of NCBI, and this information materialized in the ontology. However, it will not
be possible to represent the taxonomic classification of ad hoc species that appear
in some data or result set, other than by name alone. The taxonomic classification
of such names would need to be explored by reference to an external service such
as NCBI Taxonomy. An automatic consequence of the rules applied for taxonomic
nomenclature is that the genus classification of a species and subspecies is
recorded as the first part of the binomial (and the species identity is recorded in a
subspecies name) so that this amount of taxonomic classification is available even
from a bare name, and could be of significant value for many comparative
mapping tasks (there will be unavoidable discrepancies due to differences in the
names applied).

Currently the ontological representation of TaxonomicEntities is part of the main
bio.owl ontology. This probably should be modularised to a separate Unit, for ease
of maintenance. Ideally the provision of external third-party Name and Taxonomy
resolution services will supersede the need to provide our own representation of
taxonomy. In an ideal world taxonomic resolution services will provide GUIDs for
names and concepts, and ComparaGRID or some external service will be able to
map names used in genomic datasets to these GUIDs.

Figure 16: The representation of Taxonomy in ComparaGRID. Taxon Names are
used to represent the bare names that identify species etc., (and have Rank and
Kingdom). The ontology can include names used for the common and model
species found in genomic data, and new classes can be created from the new
names present in the data sets. Taxonomic Classifications are represented for
Taxon Concepts (according to some published Taxonomy: NCBI, ITIS, etc.).

Taxonomic Hierarchies can be represented using ontological properties. A
TaxonConcept can have a parentTaxon, for example the NCBITaxonConcept
“Gallus_gallus_NCBI” has a TaxonomicIdentification as the TaxonName
“Gallus_gallus” (which hasName value “Gallus gallus” rank “Species” synonym
“Gallus domesticus”, kingdom “Animalia”) and hasParentTaxon “GallusNCBI”,
which has TaxonomicIdentification as the TaxonName “Gallus” (which hasName
value “Gallus”, rank “Genus”, kingdom “Animalia”). Figure 17 illustrates how we
can represent the ComparaGRID concepts for the model organism “Cattle” and its
relationships to the sub or sib species Domestic Cattle and Zebu, and to the NCBI
taxonomy which stores the relations that each of these is a child of the
ParentTaxon Genus: Bos. The taxonomic ‘problem’ that ComparaGRID has to
deal with is that some biologists wish to treat Zebu as a separate species, some as a
sibling subspecies and some wish to ignore the existence of Zebu and treat all data
as if it is from a discrete single species.

Figure 17: Representation of Model Organisms. The model organism “Cattle” is
divided into two sub concepts Zebu and DomesticCattle, however not all data
sources distinguish these concepts, and some data may treats the concepts as
separate species and other data consider them as sibling subspecies. Furthermore
this may not be made explicit in the data.

Identifying Model Organisms

bio:Cattle hasTaxonomicIdentification bio:Bos_taurus (hasName “Bos Taurus”).

bio:Aurochs hasTaxonomicIdentification bio:Bos_primigenius (hasName “Bos

bio:DomesticCattle hasTaxonomicIdentification bio:Bos_taurus (hasName “Bos
taurus”) hasTaxonomicIdentification bio:Bos_taurus_taurus (hasName “Bos taurus
taurus”). Captures the fact that there are two names for this concept]

bio:Zebu hasTaxonomicIdentification bio:Bos_indicus (hasName “Bos Indicus”)
hasTaxonomicIdentification bio:Bos_indicus_indicus (hasName “Bos indicus

Including NCBI Taxon Concepts in the ComparaGRID model organism concepts

bio:Bos_primigeniusNCBI isA bio:Aurochs
bio:Bos_taurusNCBI isA bio:DomesticCattle
bio:Bos_indicusNCBI isA bio:Zebu

Representing the NCBI taxonomy

bio:Bos_primigeniusNCBI hasTaxonomicParent bio:BosNCBI
bio:Bos_taurusNCBI hasTaxonomicParent bio:BosNCBI
bio:Bos_indicusNCBI hasTaxonomicParent bio:BosNCBI

The consequence of this is that we have a ComparaGRID model organism concept
for Cattle, which subsumes the concepts of DomesticCattle and Zebu. Each of
these has a name/identification(s) and we have asserted the relationships between
NCBI Taxonomy and the model organisms, so that we can explore the taxonomic
classification of the Concepts according to NCBI. Therefore we can choose to
represent Cattle datasets according to either of these contexts: ComaparaGRID or
NCBI Taxa, or just by a bare name.

8 Capturing Relationships between Concepts in the ComparaGRID

The OWL representation of relationships as binary properties, or alternatively as
Classes of Relationship has been described above (§

Simple binary relationships between classes, where no additional parameters for
the relationship is required, can be represented by ‘object properties’. The domain
and range allowed for each of these properties can be specified (but is not
discussed here), and a simple type hierarchy of properties can be created. The
overall hierarchy of object and datatype (valueSlot) properties is shown in Figure
18. For example Compositional Relationships capture the various types of
‘Partonomy’ relationships, which can be divided into Transitive and Non-
Transitive, and into Mapping Relationships or Structural Relationships etc. (see
Figure 20). The Central Dogma Relationships of molecular biology establish the
relationships between gene and gene products (transcript and protein etc).
Taxonomic Relationships capture the Identification by Name or assignment by
TaxonConcept of any Object being described (see Figure 21). Other relationships
are described in earlier sections where relevant and some are presented in more
detail in Figure 21.

Datatype Properties relate an instance of a Class (an Object) to a data type value
which might be a floating point number as in the case of hasValue, - or it might be
a Text String as in the case hasName. Some of these datatype properties have been
designed to allow transfer of information that might be present in data sets, but of
no semantic relevance for the functionality of ComparaGRID. (e.g. the
representation of ad hoc parameters using parameterName and

Figure 18: The hierarchy of Object
[O] and Datatype (valueSlot) [D]
Properties in the Ontology.

The hierarchy of datatype properties is shown in Figure 19. The properties (and
their children): parameterName, hasName, hasSequenceString, chromosomeArm,
taxonomicValueSlot, hasKeyword, hasIdentifier are all restricted to take String
values; hasValue takes a Floating Point Number; hasDate a Date, and other slots
‘Any’ type of value.

Figure 19: The expanded datatype property hierarchy

Figure 20: The expanded hierarchy of compositional
relationships that capture increasingly specific ‘Partonomy’

Figure 21: More Object Properties expanded to show the specialisation of
Relationships represented.

A far richer representation of relationships is possible using OWL Classes, which
allows the representation of N-ary and Parameterized Relationships (see §
). A
major advantage to this form of representation in the ontology is that we can attach
evidenceProperties to any Relationship that is represented as a Class.
ComparaGRID (or a ComparaGRID user) could then use such evidence (evidence
codes, citations, authorships etc.) to weight the reliability of different mapping

The ontology defines two types of relationships in the service module, promoting
reusability of the definitions across the modules. A (directional) Relationship
Class uses the properties relatesFrom: <owlThing> and relatesTo:<owl:Thing>,
whilst its subclass BidirectionalRelationship uses the property relates:
<owl:Thing> to relate any number of objects in a non-directional relationship. The
biological relationships defined in the bio.owl module inherit these properties from
the generic Relationship (or can be automatically classified according to their

Currently we have represented as types of Relationship Class those relationships
that are necessarily N-ary or require to be parameterized, and these are shown in
Figure 22. Most of the Relationship types have unrestricted ‘Domains’ and
‘Ranges’, in that any DomainConcept can be the Object of the relates Property.

It may be preferable to refactor more of the current Object Properties in this
fashion as Relationship Classes, particularly to allow the attachment of evidence.
Currently there is some redundancy in the ontology, where it is possible to capture
a given relationship using either an Object Property or a Class Relationship. Once
the ontology has been tested with real data it will be desirable to choose the single
most appropriate design pattern for each relationship. For example
TaxonomicIdentification is currently represented as a subclass of Relationship
(which relatesFrom:DomainConcept relatesTo:TaxonName) and by the Object
Property hasTaxonomicIdentification (with Domain:DomainConcept and
Range:TaxonName). Sequence relationships can also be represented by types of
sequenceRelationship sub-properties in addition to the DNASequenceSimilarity
Class (a subclass of SequenceSimilarity in Figure 22); however, the Object
Property representation is far less flexible than the Class Relationship which could
be parameterized with similarity scores, evidence etc.

Note that a temporary test Relationship type (NucleotideSequenceRelationship)
has been included which will allow direct sequence-to-sequence mappings to be
expressed without going through a map extraction (see §

Figure 22: The hierarchy of Relationships represented as OWL Classes, following
reclassification according to their asserted properties. The Classes shown in blue
can be inferred to be Bidirectional relationships because they use the property
relates in place of relatesFrom/relatesTo.

9 Glossary

Class OWL classes are interpreted as sets that contain
individuals. They are described using formal
(mathematical) descriptions that state precisely the
requirements for membership of the class. For example,
the class Cat would contain all the individuals that are
cats in our domain of interest. Classes may be organised
into a superclass-subclass hierarchy, which is also known
as a taxonomy. Subclasses specialise (‘are subsumed by’)
their superclasses. For example consider the classes
Animal and Cat – Cat might be a subclass of Animal (so
Animal is the superclass of Cat). This says that, ‘All cats
are animals’, ‘All members of the class Cat are members
of the class Animal’, ‘Being a Cat implies that you’re an
Animal’, and ‘Cat is subsumed by Animal’. One of the
key features of OWL-DL is that these superclass-subclass
relationships (subsumption relationships) can be
computed automatically by a ‘reasoner’. (From

Concept The word concept is sometimes used in place of class.
Classes are a concrete representation of concepts.

Description Logics Description Logics (DLs) is an important family of
knowledge representation formalisms that is rather
closely related to many Modal and Dynamic Logics. The
main effort of the research in knowledge representation is
providing theories and systems for expressing structured
knowledge and for accessing and reasoning with it in a
principled way. Description Logics are considered the
most important knowledge representation formalism
unifying and giving a logical basis to the well-known
traditions of Frame-based systems, Semantic Networks
and KL-ONE-like languages, Object-Oriented
representations, Semantic data models, and Type

GUID Globally Unique Identifier

ITIS Integrated Taxonomic Information System

NCBI National Center for Biotechnology Information

A Practical Guide To Building OWL Ontologies Using The Protégé-OWL Plugin and CO-ODE
Tools Edition 1.0 (Matthew Horridge, Holger Knublauch, Alan Rector, Robert Stevens, Chris
OWL The OWL Web Ontology Language is designed for use
by applications that need to process the content of
information instead of just presenting information to
humans. OWL facilitates greater machine interpretability
of Web content than that supported by XML, RDF, and
RDF Schema (RDF-S) by providing additional
vocabulary along with a formal semantics. OWL adds
vocabulary for describing properties and classes: among
others, relations between classes (e.g. disjointness),
cardinality (e.g. "exactly one"), equality, richer typing of
properties, characteristics of properties (e.g. symmetry),
and enumerated classes. OWL has three increasingly
expressive sublanguages: OWL Lite, OWL DL, and
OWL Full.

OWL DL OWL DL supports maximum expressiveness while
retaining computational completeness (all conclusions are
guaranteed to be computable) and decidability (all
computations will finish in finite time). OWL DL is so
named due to its correspondence with description logics,
a field of research that has studied the logics that form the
formal foundation of OWL.

OWL Full OWL Full provides the maximum expressiveness and
syntactic freedom of RDF with no computational
guarantees. It is unlikely that any reasoning software will
be able to support complete reasoning for every feature of
OWL Full.

OWL Lite OWL Lite primarily supports a classification hierarchy
with simple constraints. For example, for expressing
controlled vocabularies and classifications.

Property Properties are binary relations on individuals - i.e.
properties link two individuals together. For example, the
property hasSibling might link the individual Matthew to
the individual Gemma, or the property hasChild might
link the individual Peter to the individual Matthew.
Properties can have inverses. For example, the inverse of
hasOwner is isOwnedBy. Properties can be limited to
having a single value – i.e. to being functional. They can
also be either transitive or symmetric. (From

RDF The Resource Description Framework (RDF) is a data
model for objects ("resources") on the World Wide Web
and relations between them, it provides simple semantics
for this data model, and these data models can be
represented in an XML syntax. It is particularly intended
for representing metadata about Web resources, such as
the title, author, and modification date of a Web page
etc., or the availability schedule for some shared
resource. However, by generalizing the concept of a
"Web resource", RDF can also be used to represent
information about things that can be identified on the
Web, even when they cannot be directly retrieved on the

RDF Schema A vocabulary for describing properties and classes of
RDF resources, with a semantics for generalization-
hierarchies of such properties and classes.

Relation(ship)s An ontology describes the concepts in the domain and
also the relationships that hold between those concepts.
In different formalisms these relationships are
represented variously by ‘Properties’ (OWL), ‘Slots’
(Protégé), ‘Roles’ (Description Logics), ‘Relations’
(UML) and ‘Attributes’ (GRAIL).

Semantic Web The Semantic Web is a vision for the future of the Web,
in which information is given explicit meaning, making it
easier for machines to automatically process and integrate
information available on the Web. The Semantic Web
will build on XML's ability to define customized tagging
schemes and RDF's flexible approach to representing
data. The first level above RDF required for the Semantic
Web is an ontology language what can formally describe
the meaning of terminology used in Web documents. If
machines are expected to perform useful reasoning tasks
on these documents, the language must go beyond the
basic semantics of RDF Schema.

XML Extensible Mark-up Language (XML) is a simple, very
flexible text format derived from SGML (ISO 8879).
Originally designed to meet the challenges of large-scale
electronic publishing, XML is also playing an
increasingly important role in the exchange of a wide
variety of data on the Web and elsewhere. XML provides
a surface syntax for structured documents, but imposes
no semantic constraints on the meaning of these

XML Schema A language for restricting the structure of XML
documents and also extends XML with datatypes.