D2.2_Taxonomy_Definition Final

importantpsittacosisΒιοτεχνολογία

20 Φεβ 2013 (πριν από 4 χρόνια και 6 μήνες)

127 εμφανίσεις











CERIF for Datasets






D2.2

Taxonomy Definition







Filename:

D2.2

Taxonomy Definition


Circulation:

Restricted (PP)


Version:

2
.0


Date:

21

March

2012


Stage:

Draft


Authors:

Albert Bokma

and Sheila Garfield


Partners:

University of Sunderland

University of Glasgow

University of St Andrews

Engineering and Physical Sciences Research Council

Natural Environment Research Council

E
uroCRIS








C4D is a project funded under JISC's Managing Research Data
Programme

C4D


D2.2 Taxonomy Definition

Version 1.1 of 14 Mar 2012

D2.2 Taxonomy Definition.doc

Page
2
/x

C
OPYRIGHT

© Copyright
2011

The
C4D

Consortium
.

All rights reserved.


This document may not be copied, reproduced, or modified in whole or in part for any purpose without written
permission from the
C4D

Consortium. In addition to such written permission to copy, reproduce, or modify this
document in whole or in part, an acknowledgement of the authors of the document and all applicable portions
of the copyright notice must be clearly referenced.

This docu
ment may change without notice.


D
OCUMENT
H
ISTORY

Version

Issue Date

Stage

Content and changes

V1.0

14
Mar

201
2

Draft

Table of Contents




















C4D


D2.2 Taxonomy Definition

Version 1.1 of 14 Mar 2012

D2.2 Taxonomy Definition.doc

Page
3
/x

1

B
ACKGROUND AND
C
ONTEXT

The goal of the CERIF for Datasets (C4D) project is to investigate candidate solutions for how to
effectively deal with research datasets metadata using
CERIF
.
This issue is starting to attract
increasing interest
as the need to effectively manage datasets is made a requirement on research
institutions and fund holders through a variety of legislation and also a
s a condition of funding such
that it
is now a specific condition of funding by most research councils in
the UK.


With rapidly growing numbers of datasets to which this applies there is an urgent need to deal with
the storage and management of the actual datasets on

the one hand and

their accessibility and
discoverability on the other. The C4D project deals
more specifically with the generation and
management of metadata about datasets for the purpose of recording and discovery using CERIF as
the fundamental data model.


CERIF is a standard for modelling and ma
naging research information with a current
focus

to record
information

about people, projects, organisations, publications, patents, products, funding
, events,
facilities,

and equipment.
While other aspects of output and impact are already supported explicitly
by CERIF, datasets are currently not fully

sup
ported in the same way.
As explored in deliverable
D
2.1
Metadata Ontology there is partial support of this through the use of the
cfResultProduct
entity but
there is need to
extend
this so as to support a richer set of key elements that would be needed

for a
variety of disciplines.
One aspect which
needs further exploration is the issue of classification to
help organise datasets
in repositories

and make them more readily discoverable. This issue was
previously touched upon in deliverable D2.1
Metadat
a Ontology
and is explored further here
.


The purpose of this document is to explore possible classification schemes and to recommend which
may be the most suitable scheme to adopt for the purposes of C4D. There are
,

as it stands
,

several
considerations that will influence this choice:




Suitability of classification schemes for the purpose of cataloguing scientific datasets



Integration between metadata repositories and data repositories holding actual datasets



Ability to facilitate

meaningful discovery in terms of retrieval and browsing



Currently adopted standards already in wide circulation to minimise additional and perhaps
unrealistic

demands for reclassification or additional classification
of metadata and dataset
records


In th
e following
sections
we will look in turn at a number of important aspects: In section 2 we
consider

common c
lassification schemes

and discuss their suitability
, before discuss
ing

current UK
practice
in section 3
. In section
4

we discuss
schemes already
in use in the UK research community.

In section 5 we dis
cuss the implications for the C4D pilot and ontology
.

2


U
NIVERSAL
C
LASSIFICATION
S
CHEMES

In this section we revisit our initial discussion of classification and its use in the ontology from D2.1
Metadata Ontology with a view to

extending our discussion

in view of the need to find a suitable
classification approach and scheme for C4D.


In any discipline
,

the
organisation
of relevant information is essential
so
that
users

are able to access
and
disc
over

useful

information resources in order to be able to carry out their tasks.

Classification
is
important from the point of view of organising collections in a meaningful and manageable way
and to enable

users to
inspect

collections
in a meaningful way
.

To aid in this quest several
approaches have been proposed and are in wide use:

C4D


D2.2 Taxonomy Definition

Version 1.1 of 14 Mar 2012

D2.2 Taxonomy Definition.doc

Page
4
/x



Subject classification schemes

describe resources by their subject and are a means of
organising knowledge in libraries and other information
repositories/services
.



Classification schemes differ from
subject indexing systems

such as subject headings,
thesauri, etc. by trying to create collections of related resources in a hierarchical structure
thus they can aid information retrieval by providing browsing structures f
or subject
-
based
information; providing a browsing directory
-
type structure that is user friendly thereby
making finding and retrieving resources easier.


There are a number of classification schemes in use

with a marked focus on library and information
s
cience
, which are designed to organise and classify the world of knowledge and its contents
. T
he
most widely
-
used

universal

classification schemes are:
Dewey Decimal Classification (DDC),
Universal Decimal Classification (UDC), and the Library of Congress
Classification (LCC
)
.

(S
ee D2.1
Metadata Ontology
,
section 6,

for further details
)
.


At the broadest level the structure of
univers
al classification schemes
, i.e. DDC, UDC and LCC,

is
based o
n disciplines which are recognis
ed fundamental fields of study,
such as Philosophy, Social
sciences, Science, Technology, the Arts. Disciplines have their sub
-
disciplines, e.g. sciences include
Physics and Chemistry, and social sciences include Sociology and Economics.

I
n all of them the
arrangement of concepts is hier
archic
al
.


For the purpose of C4D
,

it will be important to find a suitable classification scheme that will enhance
the ability to correctly and reliably classify da
tasets to aid their discovery:




The
Dewey Decimal
Classification,

Univer
sal Decimal Classifi
cation
, and the Library o
f
Congress Classification
schemes mentioned above, while universal
,

do however have a
library bias which is also to do with correctly shelving and
finding items in a library.
However
what is useful to learn from them is the benefit

of a taxonomic/hierarchical system that
allows
expanding or restricting

retrieval and browsing to suit the
users'

needs.



Other systems such as UNESCO
-
SPINES or Frascati (s
ee D2.1 Metadata Ontology
, section 6)
have the benefit of being more information and

information retrieval oriented as opposed
to a library focus which is good in principle but in terms of scientific disciplines and sub
-
disciplines and fields of research they do not provide the right level of granularity to classify
large collections of
datasets in a way where interested parties can drill down to classify
datasets for later discovery and reliably find relevant datasets for a given task. A problem
with

these is that they are not widely used among the UK research community and thus
require

additional effort to get them adopted as in principle all repositories and metadata
generation tools would need to be adapted to use them causing problems both for
management of legacy collections and the classification of emerging datasets and metadata
a
bout them.



A typical
problem is the insufficient specificity with respect to sub
-
areas in different
disciplines. At the same time no universal and detailed system exists that can immediately

be applied, which
may only leave

the option of collecting subjec
t specific classification
schemes for specific disciplines. There are more generic schemes that could be applied with
a universal focus
such as

PAIS
,
but which are again very co
arse for each specific subject area.

In fact the fit with current research fields and their categorisation and place in the research
landscape in taxonomic terms is vital to a successful solution that is both appropriate and
acceptable.


These existing cl
assification systems despite appeal in terms of universality are essentially not well
suited to a scientific focus and with enough variety to cover a vast array of disciplines and sub
-
disciplines and fields of research that have grown around them, which wi
ll be essential for recording
C4D


D2.2 Taxonomy Definition

Version 1.1 of 14 Mar 2012

D2.2 Taxonomy Definition.doc

Page
5
/x

to correctly identify meaningful categories to generate metadata for new datasets and aid their
management in terms of storage, curation and discovery. As a consequence of the limitations of
these schemes it seems unlikely th
at they will

provide

a suitable solution and i
n
stead we will have to
turn our

attention to current developments in the various research councils.

3


UK

C
URRENT
P
RACTICE

Already in 2005, the Executive Group of Research Councils UK (RCUK) deliberated on acc
ess to
research outputs. As an ongoing policy, the research councils remain committed to
these principles

based on the premise that:




Ideas and knowledge derived from publicly
-
funded research must be made available and
accessible for public use as widely
and effectively as possible.



The models and mechanisms for publication and access to research results must be both
efficient and cost
-
effective in the use of public funds.



The outputs from current and future research must be preserved and remain accessible for
future generations.


Archival

Research councils consequently require that their funded researchers should, deposit the outputs
from research councils funded research
in an acceptable repository as designated by the individual
research council. That guidance will be provided by

the

individual Research Council to advise funded
researchers to deposit and generate metadata both about publications and relevant results such
as
datasets.


Open Access

As the public bodies charged with investing tax payers money in science and research, the Research
Councils take on board their responsibilities in making the outputs from this research publicly
available


not just to other rese
archers, but also to potential users in business, Government and the
public sector, and also to the public. To this end each Research Council publishes comprehensive
information about its own research outputs and achievements.
They

are committed to the g
uiding
principles that publicly funded research must be made available to the public and remain accessible
for future generations.


Following a wide consultation with stakeholders, RCUK published a
position statement

on access to
research outputs in June 2006; individual Research Councils then published their own position
statements.


In 2008, RCUK funded an independent study into open access, which was conducted by SQW
Consulting and LISU, Loughborough University. Its purpose was to identify the effects and impacts of
open access on publishing models and institutional repositories in l
ight of national and international
trends, including the impact of open access on the quality and efficiency of scholarly outputs,
specifically journal articles. The
report

from the study was published in April 2009.


In response to this, the Chief Executives of the Research Councils have agreed that over time the UK
Research Councils will support increased open access, by:




buil
ding on their mand
ates on grant
holders to deposit research papers in suitable
repositories within an agreed time period, and;



extending their support for publishing in open access journals, including t
hrough the pay
-
to
-
publish model

C4D


D2.2 Taxonomy Definition

Version 1.1 of 14 Mar 2012

D2.2 Taxonomy Definition.doc

Page
6
/x



RCUK and HEFCE joint commitment to open

access


As a result the R
CUK and HEFCE have a shared commitment to maintaining and improving the
capacity of the UK research base to undertake research activity of world leading quality, and to
ensuring that significant outputs from this activity are made

available as widely as possible both
within and beyond the research community.


To achieve this, open access needs to be implemented with clear licensing agreements, sustainable
business models, and working with the grain of established research cultures

and practices. HEFCE
and the Research Councils will work together and with other interested bodies to support a
managed transition to open access over the medium term, and welcome the work of the UK Open
Access Implementation Group in support of this aim
.


What is important to learn from these developments for C4D is the commitment to open access and
the ongoing collaboration between research councils and the research output they directly or
indirectly steward and thus also repositories of datasets. Whil
e some research councils have less
developed solutions in place, NERC and EPSRC are at the forefront by actually maintaining their own
datacentres for the curation of datasets. What is also important to note is that there are some
important current develo
pments in terms of RCUK and the practical approaches they are currently
adopting with respect to classifications which we will discuss in the following section.

4

S
HARED
C
LASSIFICATION
S
CHEMES
C
URRENTLY IN
U
SE

As far as current practice in
the
research councils is concerned there are two important
developments that are in use not just for datasets but also for other areas
including

publications and
grants and grant applications which are:




the

RCUK

Subject Classification Scheme

(
and

the Research Output System (ROS)
)



the

JACS

subject coding system

(HESA)


As far as the UK r
esearch
landscape and
Research
Councils are concerned, they currently already use
classification systems for the purpose of managing their activities and in partic
ular awards. Thus
NERC currently uses a scheme

http://www.nerc.ac.uk/funding/application/topics.asp
.

Amongst the
prevailing schemes there is the RCUK Subject Classification scheme as w
ell as JACS from HESA
http://www.hesa.ac.uk/dox/jacs/JACS_complete.pdf


With respect to the aim of building an infrastructure for the management of research information
covering researchers,
publications projects and outcomes such as datasets there are two current
activities C4D should be taking note of
:


4.1
RCUK Classification Scheme

A

core part of the Research Councils’ remit is making research data available to users. RCUK are
committed to a transparent and coherent approach and their
common principles on data polic
y

provide an overarching framework for individual Research Council policies on data policy. The
information gathered is fundamental to the Research Councils strengthening their evidence base for
strategy development, and crucial in demonstrating the benefi
ts of Research Council funded
research to society and the economy.


The
Research Outcomes System
(ROS)

allows users to provide research outcomes to four of the
Research Councils


the Arts and Humanities Research Council (
AHRC
), the Biotechnology and
Biological Sciences Research Council (
BBSRC
), the Economic and Social Research Council (
ESRC
), and
C4D


D2.2 Taxonomy Definition

Version 1.1 of 14 Mar 2012

D2.2 Taxonomy Definition.doc

Page
7
/x

the Engineering and Physical Sciences Resea
rch Council (
EPSRC
). The ROS can be used by these
Research Council grant holders to input outcomes information about their research. It can also be
used by Higher Education Institution (HEI) research offices to input

research outcomes information
on behalf of grant holders and/or access the outcomes information of grant holders in their
institution.


Grant holders, research office managers and associated staff input outcomes information,
comprising both narrative and

data, under nine different categories, or outcome types, as follows:




Publications



Other Research Outputs



Collaboration/Partnership



Dissemination/Communication



IP and Exploitation



Award/Recognition



Staff Development



Further Funding



Impact


Each of these categories has various sub
-
sections, with explanatory guidance, to help users be as
specific as possible when inputting information. In addition, grant holders are asked to provide a
brief summary of the Key Findings arising from their resear
ch. This information can be made
available to research users to stimulate engagement and collaboration.


The ROS categories provide a comprehensive, hierarchical, set of elements for inputting information
with regard to outcomes from research. Within each of the top level outcome types (categories)
there are additional selection criteria or options available
at a first tier and a second tier level. For
example, the top level “Other Research Outputs” outcome type has options, at the first tier, for
biological outputs, electronic outputs and physical outputs; for electronic outputs, at the second tier
level, the
re is the option of selecting ‘dataset’ which is described as “a structured record of the value
of variables that were measured as part of the research”.


The RCUK classification scheme has 78 first level categories and then subcategories to level 3. The

main categories and selective subcategories are shown in the list below. Please note that resolution
at lev
el 3 can vary from a single sub
category to 30 or more:





Agri
-
environmental science

o

Agricultural systems

o

Crop protection

o

Crop science

o

Earth and environmental

o

Soil science





Animal science

o

3Rs [Reduction, Refinement, Replacement]

o

Animal and human physiology

o

Animal behaviour

o

Animal diseases

o

Animal organisms

o

Animal reproduction

o

Animal welfare

o

Endocrinology

o

Immunology

o

Livestock p
roduction

o

Musculoskeletal system

C4D


D2.2 Taxonomy Definition

Version 1.1 of 14 Mar 2012

D2.2 Taxonomy Definition.doc

Page
8
/x

o

Parasitology


o

Psychology

o

Systems neuroscience





Archaeology

o

Archaeological Theory

o

Archaeology Of Human Origins

o

Archaeology Of Literate Societies

o

Industrial Archaeology

o

Landscape & Environmental Archaeology

o

Maritime Archaeology

o

Palaeobiology

o

Prehistoric Archaeology

o

Quaternary Science

o

Science
-
Based Archaeology





Astronomy
-

observation

o

Astronomy & Space Science Technologies

o

Data Handling & Storage

o

Extra
-
Galactic Astronomy & Cosmology

o

Galactic &

Interstellar Astronomy

o

Instrumentation for Particle Physics Or Astronomy

o

Stellar Astronomy





Astronomy
-

theory

o

Computational Methods and Tools

o

Extra
-
Galactic Astronomy & Cosmology

o

Galactic & Interstellar Astronomy

o

Stellar Astronomy





Atmospheric
physics and chemistry

o

Atmospheric Kinetics

o

Boundary Layer Meteorology

o

Land
-

Atmosphere Interactions

o

Large Scale Dynamics/Transport

o

Ocean
-

Atmosphere Interactions

o

Radiative Processes & Effects

o

Stratospheric Processes

o

Tropospheric Processes

o

Upper Atmosphere Processes & Geospace

o

Water In The Atmosphere





Atomic and molecular physics

o

Atoms and Ions

o

Cold Atomic Species

o

Fusion

o

Light
-
Matter Interactions

o

Scattering & Spectroscopy





Bioengineering

o

Biochemical engineering

o

Bioenergy

o

Biomate
rials

o

Bionanoscience

o

Bioreactors

o

Environmental biotechnology

o

Macro
-
molecular delivery

o

Metabolic engineering

o

Novel industrial products

o

Protein engineering

o

Tissue engineering





Biomolecules and biochemistry

o

Ageing: chemistry/biochemistry

o

Biochemis
try and physiology

o

Bioenergetics

o

Biological membranes

o

Biophysics

o

Carbohydrate Chemistry

C4D


D2.2 Taxonomy Definition

Version 1.1 of 14 Mar 2012

D2.2 Taxonomy Definition.doc

Page
9
/x

o

Catalysis and enzymology

o

Chemical Biology

o

Multiprotein complexes

o

Protein chemistry

o

Protein expression

o

Protein folding / misfolding

o

Structural biology





Catalysis and surfaces

o

Catalysis and Applied Catalysis

o

Complex fluids and soft solids

o

Electrochemical Science and Engineering

o

Surfaces and Interfaces





Cell biology

o

Ageing: chemistry/biochemistry

o

Cell cycle

o

Cells



Adipocytes




B cells




Basophils




Blastocyst




Cells




Chondrocyte




Collenchyma




Dendritic cells




Endothelial cells




Eosinophils




Epidermal cells




Epithelial cells




Erythrocytes




Granulocytes




Guard cells




Haematopoietic stem cells




Hepato
cytes




Keratinocytes




Leukocytes




Lymphocytes




Macrophages




Mast cells




Melanocytes




Myocytes




Natural killer cells




Neutrophils




Oocytes




Osteoblasts




Osteoclasts




Osteocytes




Pancreatic alpha cells




Pancreatic beta cell




Parenchyma




Photoreceptor cells




Platelets




Regulatory T cells




Sclerenchyma




Stem cells




Stromal cells




T Cells




Trophoblast

o

Communication and signalling

o

Organelles and components

o

Receptors

o

Stem cell

biology

C4D


D2.2 Taxonomy Definition

Version 1.1 of 14 Mar 2012

D2.2 Taxonomy Definition.doc

Page
10
/x





Chemical reaction dynamics and mechanisms





Chemical measurement





Chemical synthesis





Civil engineering and built environment





Classics





Climate and climate change





Complexity science





Cultural and museum studies





Dance





Demography and human geography





Design





Development studies





Drama and theatre studies





Ecology, biodiversity and systematics





Economics





Education





Electrical engineering





Energy





Environmental engineering





Environmental planning





Facility Development





Food science and nutrition





Genetics and development





Geosciences





History





Information and communication technologies





Instrumentation, sensors and detectors





Languages and Literature





Law and legal studies





Libra
ry and information studies





Linguistics





Management and business studies





Manufacturing





Marine environments





Materials processing





Materials sciences





Mathematical sciences





Mechanical engineering





Media





Medical and health interface





Microbial sciences





Music





Nuclear physics





Omic sciences and technologies





Optics, photonics and lasers





Particle astrophysics





Particle physics
-

experiment





Particle physics
-

theory





Philosophy





Planetary science





Plant and crop s
cience





Plasma physics





Political science and international studies





Pollution, waste and resources





Process engineering

C4D


D2.2 Taxonomy Definition

Version 1.1 of 14 Mar 2012

D2.2 Taxonomy Definition.doc

Page
11
/x





Psychology





Social anthropology





Social policy





Social work





Sociology





Solar and terrestrial physics





Superconductivity, magnetism and quantum fluids





Systems engineering





Terrestrial and freshwater environments





Theology, divinity and religion





Tools, technologies and methods





Visual arts

Table
1
: The RCUK Subject Classification Scheme


4.2

JACS

Classification Scheme

The
Joint Academic Classification of Subjects

(JACS) system is used by the
Higher Education
Statistics
Agency

(HESA) and the
Universities and Colleges Admissions Service

(UCAS) in the
United
Kingdom

to classify academic subjects, especially for undergraduate degrees.


A JACS code for a single subject consists of a letter and three numbers. The letter represents the
broad subject classificatio
n and subsequent numbers represent further details, similar to the
Dewey
Decimal System
. For example, F represents the Physical Sciences, F300 Physics, F330 Environ
mental
Physics and F331 Atmospheric Physics.


JACS
Codes

Letters are assigned to the subject groups as follows. Note that the in the proposed version 3 of
JACS, Computer Sciences are split from Mathematics and assigned code letter
I


Letter

Subject Group

Major subgroups

A

Medicine and Dentistry

A100 Pre
-
clinical Medicine,

A200 Pre
-
clinical Dentistry,

A300 Clinical Medicine,

A400 Clinical Dentistry

B

Subjects allied to Medicine

B100 Anatomy, Physiology and Pathology,

B200 Pharmacology, Toxicology and Pharmacy,

B300 Complementary Medicine,

B400 Nutrition,

B500 Ophthalmics,

B600 Aural and Oral Sciences,

B700 Nursing,

B800 Medical Technology

C

Biological Sciences

C100 Biology,

C200 Botany,

C300 Zoology,

C400 Genetics,

C500 Microbiology,

C600 Sports Science,

C800 Psychology

D

Veterinary Sciences, Agriculture and
D100 Pre
-
clinical Veterinary Medicine,

C4D


D2.2 Taxonomy Definition

Version 1.1 of 14 Mar 2012

D2.2 Taxonomy Definition.doc

Page
12
/x

related subjects

D400 Agriculture,

D500 Forestry,

D600 Food and Beverage Studies

F

Physical
Sciences

F100 Chemistry,

F200 Materials Science,

F300 Physics,

F400 Forensics and Archaeology,

F600 Geology,

F700 Ocean Sciences,

F800 Physical and Terrestrial Geographical and Environmental
Sciences,

F840 Physical Geography

G

Mathematical and
Computer Sciences

G100 Mathematics,

G300 Statistics,

G400 Computer Science,

G600 Software Engineering,

G700 Artificial Intelligence

H

Engineering

H200 Civil,

H300 Mechanical,

H400 Aerospace,

H500 Naval Architecture,

H700 Production and Manufacturing,

H800 Chemical

J

Technologies

J200 Metallurgy,

J300 Ceramics and Glasses,

J400 Polymers and Textiles,

J500 Materials Technology

K

Architecture, Building and Planning

K100 Architecture,

K200 Building,

K400 Planning

L

Social studies

L100 Economics,

L200 Politics,

L300 Sociology,

L400 Social Policy,

L500 Social Work,

L600 Anthropology,

L700 Human and Social Geography

M

Law

M100 Law by geographical area,

M200 Law by topic

N

Business and Administrative studies

N100 Business Studies,

N200 Management,

N300 Finance,

N400 Accounting,

N500 Marketing

P

Mass Communications and
Documentation

P300 Media Studies,

P500 Journalism

Q

Linguistics, Classics and related
subjects

Q100 Linguistics,

Q500 Celtic Studies

R

European Languages, Literature and
R100 French Studies,

C4D


D2.2 Taxonomy Definition

Version 1.1 of 14 Mar 2012

D2.2 Taxonomy Definition.doc

Page
13
/x

related subjects

R200 German Studies,

R300 Italian Studies,

R400 Hispanic Studies,

R600 Scandinavian Studies,

R700 Russian Studies

T

Eastern, Asiatic, African, American and
Australasian
Languages, Literature and
related subjects

T100 Chinese Studies,

T500 African Studies

V

Historical and Philosophical studies

V100 History by period,

V200 History by area,

V350 History of Art,

V400 Archaeology,

V500 Philosophy,

V600 Theology and
Religious Studies

W

Creative Arts and Design

W100 Fine Art,

W200 Design,

W300 Music,

W400 Drama,

W500 Dance,

W600 Cinematics and Photography,

W700 Crafts,

W800 Creative Writing

X

Education

X100 Training Teachers

Y

Not used

Not used

Z

Not used

Not used

Table 2
: The JACS Classification Scheme


To compare the relative resolution and thus specificity of the JACS scheme with

the
RCUK

the same
topic area of microbiology
,

highlighted earlier in

the

RCUK

scheme
,

has been selected and

the
corresponding

JACS codes available are shown below:


C500

Microbiology

C510

Applied microbiology

C520

Medical & veterinary microbiology

C521

Medical microbiology

C522

Veterinary microbiology

C530

Bacteriology

C540

Virology

C550

Immunology

C570

Serology

C590

Microbiology not elsewhere classified

Table 3
: The JACS C500 Microbiology Subcategories


Subject organis
ation in classification

schemes

is
language independent; the subject is symbolis
ed by
a class number, which allows for cross language and cross collect
ion information discovery.
Material

on a certain subject will be co
-
located under the same class number irrespective
of
the language
in
C4D


D2.2 Taxonomy Definition

Version 1.1 of 14 Mar 2012

D2.2 Taxonomy Definition.doc

Page
14
/x

which it is written
or the language
used by the

cataloguing centre assigning the
classification
number. Thus a

classific
ation scheme is an indexing and retrieval language. It groups related items
into classes, and arranges such groups in a hierarchy so that users can then trace topics in their
context and scan subject field
s

from general to specific or vice versa
.
It organi
se
s

and present
s

those
resources in such a way that the user can retrieve all the relevant resources quickly and easily.


The use of

a classification scheme

offers one solution to providing improved access to resources.

Their hierarchical nature means that

c
lassification schemes can also be used to
provide

an overview
of resources covering broader or narrower topics as
the user

move
s

up or down the hierarchy.
As a
result
users

are provided with

the opportunity to view related resources which may be relevant

to
their information needs.

There is, however, no requirement to use all layers of the chosen
classification
scheme
hierarchy. Some current schemes organise material based on the first three
levels only of a decimal scheme like DDC. The key point about th
e large established library
classification schemes is that they are universal schemes; they are built to classify an entire world
with all its content, and a user can, therefore, find most things using them.


Classification can be used as a way of making
searches more powerful and to limit the number of
irrelevant hits for the user and, searching
,

using a classification scheme can be offered in different
ways in the user interface. Sections of the classification scheme can be offered as a filter (or option
)
in the search, limiting the results of the query to a certain topical part of the data store.


The
general advantages

of using a classification

scheme are
:




It helps bring together
collections of similar resources



The use of a systematic well
-
s
upporte
d hierarchical structure

supports the browsing of these
collections



It can

help to overcome problems of unfamiliar terminology, allowing non
-
specialists to find
information through subject browsing



It gives a context for search terms and allows filtering
and high precision searches


Browsing
, and browsing a hierarchical, directory
-
type structure,

is particularly helpful for
inexperienced users or for users not familiar with a subject and its structure and terminology. In
addition, the structure of the cla
ssification scheme can be displayed in different ways as a
n aid to
navigation
.
Users typically are able to choose categories from a subject hierarchy and to use these to
make their way through the
structure
, moving down the individual branches of a subject

tree.
Hierarchical classification schemes can therefore be used for broadening and narrowing searches;
broadening can be used to improve recall
while
narrowing
can be used
to reduce the number of
fa
lse hits to improve precision. Query
ing can be used for f
iltering to
narrow or
limit individual parts
of a collection

which reduces the number of false hits
.
Also, t
he use of a classification scheme gives
context to the search terms used
.

W
ords which have the same form and spelling but a different
meaning

(homonyms)

can cause a problem but this can be partly overcome
because the context of
the broader subject area or discipline will
,

in most cases
,

unambiguously indicate

their meaning
.


U
sually
a
n established classification
scheme
does not become obsolete.
The larger schemes
do
undergo

revision

which is

formally published in numbered editions.
As such

s
ome classifications may
hav
e to be changed when a new edit
ion of a scheme is published;
however,

it is unlikely that every
single resource will have to be rec
lassified.

Therefore,

if the metadata created is classified using a
n
established

classification scheme the C4D subject demonstrator should be able to generate a
browsing structure from this information.



C4D


D2.2 Taxonomy Definition

Version 1.1 of 14 Mar 2012

D2.2 Taxonomy Definition.doc

Page
15
/x

Conclusions

Overall, the
RCUK

categories present a
good level of detail with regard to research outputs and could
provide the fine detail that will be needed for the C4D subject demonstrator. Ultimately, however,
these categories do not include all of the metadata elements that are covered by standards suc
h as
MEDIN, DIF, CSMD and do not provide coverage either for any of the gaps that have been identified
in other standards.


There are clear lessons to be learnt from them in terms of the strategy used and of particular note is
also the RCUK classification
system which will be incorporated into the C4D demonstrator for
example.
The basis for suggesting using the Research Councils’ classification scheme is:




It is cross
-
disciplinary, meaning the Arts are as well represented as the Sciences, etc.



It is
hierarchical; this potentially allows more intuitive associations to be made between
related research



The use of an agreed classification scheme could enable improved browsing and su
bject
searching across datasets


As can be seen from the
tables

with respe
ct to the RCUK and JACS systems there a
re significant
differences. W
hile JACS covers the academic subject areas the fit with respect to courses and
academic subject areas in terms of organisational units seems very well represented. On
the

other
hand the
RCUK system is more focused on fields of research and disciplines and sub
-
disciplines and
has a significantly finer level of resolution. The microbiology subject area we compared between
JACS and RCUK shows that RCUK is significantly more refined in terms

of topics than JACS. While a
coarser grained system may be quite suitable in situations where there are limited numbers of
datasets to be administered, when the number starts to grow significantly into an order where there
are tens of thousands of datase
ts in a particular discipline such as biology
,

further refinement will be
useful to be able to distinguish between them and
to allow the

manage
ment of

the discovery and
retrieval more efficiently. Both system
s

benefit from a hierarchical approach which in

terms of
semantic based retrieval will be very useful as it allows
expanding or restricting

the retrieved results
by moving to summary results of the parent categor
y or suggesting a number of sub
categories to
further restrict results respectively.


Give
n the aim of C4D to manage metadata of datasets for a variety of disciplines and in particular to
support the discovery of them in terms of search based retrieval or explorative browsing the

RCUK

categories
set seems to be a better fit also with a view
to

generating a scalable solution.
Furthermore, the fact that RCUK is already in wider circulation in research councils for use with grant
applications and research outputs this would also provide a compatible solution that would allow
potentially data from

different location
s

and different types to be integrated at the point of
browsing.

5

C4D

P
ILOT AND
O
NTOLOGY
I
MPLICATIONS

The

C4D

project focuses on the metadata of research datasets, and integrates

this metadata with
that held on research projects and research outputs
from different locations
.
The project
is expected
to

extend the use of CERIF into the research data management area.

In order to demonstrate the
facility of the approach
,

a
pilot
demo
nstrator will be built and this approach verified at the three
partner HEIs, each within their own research administration infrastructure.


As can be seen from t
he brief s
takeholder analysis

presented in the deliverable D2.1 Metadata
Ontology
, d
atasets are an important kind of
resource
.
Enabling better research and accelerating the
rate of discoveries and assessing and demonstrating impact a
re therefore key considerations.

C4D


D2.2 Taxonomy Definition

Version 1.1 of 14 Mar 2012

D2.2 Taxonomy Definition.doc

Page
16
/x


The aim is to deve
lop a semantic

interface to CERIF
data stores

which will provide the key features of
adding new records and the associated metadata markup
,
as well as supporting the
discovery,
exploration and linking

of the available resources
.


The taxonomy will play a crucial role in structuring the available re
sources such making them
manageable from a storage perspective as well as allowing them to be discovered in a meaningful
and intuitive way. Using a uniform classification system across the different entities or datasets,
their associated researchers, inst
itutions, projects and publications allows records to be correlated
and collated and is thus also crucial when attempting to generate a mechanism by which records
from different repositories are to be made accessible in a transparent and seamless fashion.


By modelling the key co
ncepts in the CERIF
data

store

an interface can be built that has the desired
functionality of enabling the non
-
invasive retrieval of records also across
data stores

and the
representation and navigation of links between entities so as to
,

for example
,

visualise related
records surrounding a particular dataset, such as other data
sets, researchers, publications
, projects
,

etc.

The envisaged ontology will need to cover

several key aspects, including:




Datasets



Researchers



Projects



Publications



















Figure 1
: Datasets and Related
Entities Browsing


The use of

a common classification scheme in combination with these key concepts will allow any of
these categories and existing records to be managed for the purpose of discovery and integration
.
Thus the

ontology will also need to contain the classification scheme

as a topic hierarchy to be used
to classify individual researchers, projects, datasets and publications so that the activities in
particular subject areas become extractable and so that the proposed vision of being able to explore
activity surrounding par
ticular datasets
, as depicted

in the graphical form proposed in deliverable
D2.1
Metadata Ontology
reproduced above
, is achieved
.
The fact that the classification scheme is
also hierarchical is useful as it allows a richer set of functionality to be gener
ated rather than out of a
flat dictionary.

Pub2

DS2


Pers
A

DS1


Pers
B

Proj
1

Pub1

Pub3

Proj
2