How Well Do You Know Your Data?

premiumlexicographerInternet and Web Development

Dec 8, 2013 (3 years and 4 months ago)

56 views

JATS
-
CON 2012

October 16, 2012


Faye Krawitz

Jennifer McAndrews

Richard O’Keeffe

Content Technology Group, AIP



How Well Do You Know Your Data?
Converting an Archive of Proprietary
Markup Schemes to JATS: A Case Study

JATS
-
CON 2012

October 16, 2012



Founded in 1931


Umbrella organization for 10 physical science societies.
Combined membership totals 165,500 scientists,
engineers and educators (with some overlap)


One of the world's largest non
-
profit publishers of
scientific information in physics.


Home of the Physics Resources Center


Publish 24+ AIP, member, partner journals/magazines,
three of which are co
-
published with other organizations,
and one conference proceedings series


Mission: To inspire every Physical and Applied Scientist
in the world to turn to AIP for the information and help
that they need




AIP at a glance

JATS
-
CON 2012

October 16, 2012


The AIP Content Ecosystem



The AIP Content Collection


800,000 SGML/XML records encoded in


AIP ISO
12083 “
header” SGML DTD (1995
-
present)


AIP ISO 12083 “full
-
text” SGML DTD (1995
-
2005)


AIP “ISO
-
12083
-
informed” full
-
text XML DTD (2005
-
present)


How was it used?


XML the source for print/online PDFs


The source for HTML rendered on the AIP online platform



And it worked well…but the times they were a changing


3

JATS
-
CON 2012

October 16, 2012


What’s the problem…Why change?


AIP
-
centric!


XML
overly specialized for specific AIP
products


Required proprietary systems and support


Too many intermediary data transformations


Limited the adoption of new technology and standards


Too costly to maintain


Not the XML format of choice for data recipients


4

JATS
-
CON 2012

October 16, 2012


Redefining AIP’s future content
strategy:
If you could have anything you want…


Recognition that the intellectual property is the
premium asset


Markup the data to maximize its value and
enrichment potential


Keep current with
i
ndustry standards


Better meet client expectations!


Plan for success


Streamlined production workflow


Reorganize
units to execute a unified content
strategy


Not enough to realize the need to change, but to
follow through and execute


5

JATS
-
CON 2012

October 16, 2012


C’mon…everybody does it!


Standardization 1: adopt industry
standard XML


Eliminate multiple formats and associated
transformations


Enhanced data portability


Standardization 2: adopt XML
technologies such as XSLT and
Schematron


Minimize dependence on specialized
applications and skill
sets


Speak the same language as the STM
Community

6

JATS
-
CON 2012

October 16, 2012


7

JATS
-
CON 2012

October 16, 2012


(Not so) Big Surprise!

Journal and Archiving Interchange Tag Set

8

JATS
-
CON 2012

October 16, 2012


Build for Success: Communication



Make the plan known


Keep everyone informed
and

updated


Get “buy
-
in”


Ensure
the
whole organization understands
the
change in approach


Ensure the whole organization understands
the
end
goal


Ensure the staff understands the important
role they play in the success


9

JATS
-
CON 2012

October 16, 2012


Build for Success: Ownership


Organize to succeed


Rethink and deploy an organization that most
effectively achieves the goal



For AIP this meant…


Create a unified team following the overall
strategy


Foster a definitive sense of ownership for the
content as the “intellectual asset”


Develop a clear chain of content responsibility


Designate formal content “gatekeepers”


10

JATS
-
CON 2012

October 16, 2012


Build for Success: Infrastructure



Invest in an up
-
to
-
date content management system


Efficiently manage content, not have the product(s)
manage the systems


Avoid unneeded workflow duplication


Avoid unwanted “end
-
around” content manipulation


Extensibility to adapt to future needs


Excellent versioning capabilities


Effective reporting tools


11

JATS
-
CON 2012

October 16, 2012


Now What?


Transform Decisions


Use XSLT


Create “mapping specification” for the following:


Transform AIP ISO
12083 “
header” SGML DTD


Transform AIP “ISO
-
12083
-
informed” full
-
text XML DTD


On hold: AIP ISO 12083 “full
-
text” SGML DTD


Test and adapt based on results


Quality Control including Schematron


Document


Train staff and production partners

12

JATS
-
CON 2012

October 16, 2012


The Process


Document Analysis


Helpful aids


Existing documentation


Institutional memory


Devise tagging principles


Correct known ambiguities



13

JATS
-
CON 2012

October 16, 2012


Document Analysis


Identify:


Consistencies


Inconsistencies


Surprises


Evaluate tagging requirements


Create


Document Map (or “specification”)


Sample XML files as needed




14

JATS
-
CON 2012

October 16, 2012


Devised Tagging Principles


Strictly delineated element v. attribute


Defined AIP
-
specific usage of JATS


Treated <article
-
meta> as database
-
like


Avoided customized content models; reserved
for later use


Reserved <x> markup for future use; use at
transform as debugging tool


Reserved <named
-
content> for semantic
enrichment markup


15

JATS
-
CON 2012

October 16, 2012


Creating the Document Map

Tagging Principles x (Existing documentation + Institutional Memory) = JATS












X





+








=


16

JATS
-
CON 2012

October 16, 2012


Resulting Map (“spec”)

ELEMENT

AIP TAGGING

JATS

Action:

metanote



metanote
/
edcode









<metanote>

Contributed by the Bioengineering Division of ASME
for publication in the J<emph
type="smallcap">OURNAL OF</emph> B<emph
type="smallcap">IOMECHANICAL</emph> E<emph
type="smallcap">NGINEERING</emph>. Manuscript
received July 20, 2009; final manuscript received
February 18, 2010; accepted manuscript posted March
1, 2010; published online June 18, 2010. Assoc. Editor:
<techeditor status="associate">Ellen M.
Arruda</techeditor>.</metanote>....</metanote

</article
-
meta>

<notes notes
-
type=”metadata
-
note”>

<p>Contributed by the Bioengineering Division of
ASME for publication in the J<sc>OURNAL OF</sc>
B<sc>IOMECHANICAL</sc> E<sc>NGINEERING</sc>.
Manuscript received July 20, 2009; final manuscript
received February 18, 2010; accepted manuscript
posted March 1, 2010; published online June 18,
2010. Assoc. Editor: J. Shah.</p></notes>

1.Convert as <notes> with
@notes
-
type=”metadata
-
note”



2.<notes> tag is placed after

</article
-
meta>




3. Suppress tag, keep contents of:
metanote
/
edcode
,
metanote
/symposium,
metanote
/
contribgrp



4.

UPDATE:02/21



wrap
contents in <p>
-

this will not
be in the source.



Info: Okay tags below are
suppressed:

meta
-
received|meta
-
accepted|meta
-
revised|meta
-
presented|meta
-
submit|meta
-
published | meta
-
posted.



***N/A Future JATS***

17

JATS
-
CON 2012

October 16, 2012


Corrected Known Ambiguities


Before






After

<extra1>






<
suffix>

<extra2>






<
role>

<extra3>






<
degree>


18

JATS
-
CON 2012

October 16, 2012


Expected Trouble Spots



Generated text


Style variation
issues


Multi
-
purpose tags


Multimedia


Time

19

JATS
-
CON 2012

October 16, 2012


Generated Text

20

The ability to take a tag like <
ack
> and output
the title “
ACKNOWLEDGMENTS” is the closest
thing we have to magic.

JATS
-
CON 2012

October 16, 2012


Style Variation Issues


INTRODUCTION

INTRODUCTION


I. INTRODUCTION

1.
Introduction

Introduction


21

JATS
-
CON 2012

October 16, 2012


Mulitpurpose

tags

Three distinct rules for handling one
sgml

element, all within References:


1. when
<
othinfo
> is sibling of <
refitem
>:


a. <
othinfo
>


remove tag, retain PCDATA


b. Retain
content/punctuation and trailing space


c.
MOVE retained PCDATA to before </mixed
-
citation> of preceding <
mixed
-
citation>


2.When back/citation/ref/
othinfo
: Strip
<
othinfo
>, retain
PCDATA


3. NOTE
: nesting of <
othinfo
> requires:

<
citation id="r#"><ref><
biother
>

<
othinfo
>…

<
othinfo
><
dformula
>


<ref
><label>#. </label><note>

<p
>….<
disp
-
formula>…


22

JATS
-
CON 2012

October 16, 2012


Multimedia

23


1. <
epaps
>See
supplementary material at

<
url

href
=”http://dx.doi.org/10.1063/1.3475476”>http://dx.doi.org/10.1063/1.3475476</url>
<
epapsid

display="no"
type
=“multimedia">
E
-
JAPIAU
-
108
-
032016</
epapsid
> for
essential multimedia.</
epaps
>


2. <media
id="v1" status="essential">

<media
-
object
doi
="10.1063/1.3674301.1" file
-
name
=“006029jcpv1.mpg
" id="mm1" mime
-
type="video/mpeg" mm
-
type="video" version="original
"><
mediaref

rids="v1" show
-
link="yes"></
mediaref
>


3. <media
id="v1" status="essential">

<media
-
object
doi
="10.1063/1.3674301.1" file
-
name="v1.mpg" id="mm1" mime
-
type="video/mpeg" mm
-
type="video"
version="original
"><
mediaref

rids="v1" show
-
link="yes"></
mediaref
>


4. <media
id="v1" status="essential">

<media
-
object
doi
="10.1063/1.3674301.1" file
-
name="v1.mpg" id="mm1" mime
-
type="video/mpeg" mm
-
type="video"
version="original">

<
mediaref

rids="v1" show
-
link="yes"></
mediaref
></
media
-
object
></
media>

<media id="v2" status="essential">

<media
-
object
doi
="10.1063/1.3674301.2" file
-
name="v2.mpg" id="mm2" mime
-
type="video/mpeg" mm
-
type="video"
version="original">

<
mediaref

rids="v2" show
-
link="yes"></
mediaref
></
media
-
object
></
media>

<media id="v3" status="essential">

<media
-
object
doi
="10.1063/1.3674301.3" file
-
name="v3.mpg" id="mm3" mime
-
type="video/mpeg" mm
-
type="video"
version="original">

<
mediaref

rids="v3" show
-
link="yes"></
mediaref
></
media
-
object
></
media>.


JATS
-
CON 2012

October 16, 2012


Time

24


JATS
-
CON 2012

October 16, 2012


Unexpected Trouble Spots:

Language







25

JATS
-
CON 2012

October 16, 2012


Language

Deceptively simple example:



Before


pacs



After:


front/spin/
docanal
/
pacs


26

JATS
-
CON 2012

October 16, 2012


Unexpected Trouble Spots:

Nasty Surprises

27

JATS
-
CON 2012

October 16, 2012


Nasty Surprises

Expected tagging:

<
p content
-
type="
leadpara
”>
Weak signal detection possesses the potential
application in many fields. By utilizing the sensitivity of the nonlinear system
...</
p>

Displays online as:

Lead Paragraph

Weak signal detection possesses the potential application in many fields. By utilizing
the sensitivity of the nonlinear system

Actual tagging:

<p>Weak
signal detection possesses the potential application in many fields. By
utilizing the sensitivity of the nonlinear system
...</
p
>

No online display



28

JATS
-
CON 2012

October 16, 2012


QUALITY CONTROL AND TESTING



Prerequisite training


Content and tagging
checks


Incorporating
Schematron


Online displays



29

JATS
-
CON 2012

October 16, 2012


QUALITY CONTROL AND TESTING

Prerequisite



Staff
Training


NLM/JATS DTD


XPATH


XSLT


Schematron


30

JATS
-
CON 2012

October 16, 2012


QUALITY CONTROL AND TESTING

Content and tagging checks


Step 1


Preliminary Testing:



Performed
while
XSLT
was in progress


Analyst
checked completed blocks of XSLT code
and confirmed programmers understanding of
instructions


Daily meetings held to discuss new findings or
clarifications of instructions


Trouble spot detected
: specification document needed
to be re
-
written using XPATH terminology.






31

JATS
-
CON 2012

October 16, 2012


Step
2


Batch Processing



Performed
when XSLT
was complete.


Converted and parsed approximately 200
files


Investigated hidden problems and
determined if an XSLT modification or
manual fix was the best course of action to
take

32

JATS
-
CON 2012

October 16, 2012


Step 3



Group Testing



Performed when converted files were valid


Ran approximately 200
files from
various
journals with assorted article
types


Entire group checked same


sample of files


Check for dropped text


Ran Schematron



33

JATS
-
CON 2012

October 16, 2012


Step
4


Bulk Processing



Performed when all files were approved from
the group testing


Entire
corpus of content run with remaining
errors resulting
from
bad source
outliers


XSLT transformed over a 99% accuracy rate,
with 800,000 there was still a large number to
be inspected


Where applicable source or XSLT was fixed and
files rerun


34

JATS
-
CON 2012

October 16, 2012


Step
5


Final Cleanup


Analyze flagged data.

Investigated tags mapped in the XSLT to <x> or <strike> because the
source tags had known problems.

35

JATS
-
CON 2012

October 16, 2012


QUALITY CONTROL AND TESTING

Incorporating Schematron




Central piece in our QC
process derived from
our pre
-
existing proprietary QC programs


List
of checks or assertions written in XPATH
language


Tracks ERRORS and WARNINGS specific to our
data


Done in parallel while XSLT was being written

36

JATS
-
CON 2012

October 16, 2012


JATS MARKUP with
SCHEMATRON ERROR DETECTED

<
kwd
-
group kwd
-
group
-
type="pacs
-
codes">

<compound
-
kwd>


<compound
-
kwd
-
part content
-
type="code">8440
-
x</compound
-
kwd
-
part>


<compound
-
kwd
-
part content
-
type="value">Radiowave
…</
compound
-
kwd
-
part>


<compound
-
kwd
-
part content
-
type="code">8440Ba</compound
-
kwd
-
part>


<compound
-
kwd
-
part content
-
type="value">Antennas
:.

…</compound
-
kwd
-
part>

</
compound
-
kwd>

</kwd
-
group>



JATS MARKUP CORRECTED

<kwd
-
group kwd
-
group
-
type="pacs
-
codes">

<compound
-
kwd>


<compound
-
kwd
-
part
content
-
type
="code">8440
-
x</compound
-
kwd
-
part>


<compound
-
kwd
-
part content
-
type="value">Radiowave
…</compound
-
kwd
-
part>

</compound
-
kwd>

<compound
-
kwd
>


<
compound
-
kwd
-
part content
-
type="code">8440Ba</compound
-
kwd
-
part>


<
compound
-
kwd
-
part content
-
type="value">
Antennas
:. …</compound
-
kwd
-
part>

</compound
-
kwd>

</
kwd
-
group>


37

JATS
-
CON 2012

October 16, 2012


SCHEMATRON RULE

<rule id="ERROR_COMPOUND_KEYWORD" context="compound
-
kwd">

<assert role="ERROR_COMPOUND_KEYWORD"
test="count(compound
-
kwd
-
part) = 2">

[ERROR]
A compound
-
kwd must have two compound
-
kwd
-
part tags

</
assert>

</rule
>


<
rule id="ERROR_COMPOUND_KEYWORD_PART"
context="compound
-
kwd
-
part">

<assert role="ERROR_COMPOUND_KEYWORD_PART" test="@content
-
type='code' or @content
-
type='value'">

[ERROR] Invalid @content
-
type used for compound
-
kwd
-
part
-

allowable values are: code and value

</assert>


</rule>


38

JATS
-
CON 2012

October 16, 2012


QUALITY CONTROL AND TESTING

Online Displays


Assumptions at this point are: files are valid
and Schematron runs clean


Testing was expanded to online publishing
group and random testers throughout
organization


Errors were found at this point that are
apparent more in viewing


Great way to confirm that business rules are
being followed

39

JATS
-
CON 2012

October 16, 2012



LESSONS LEARNED &

GENERAL CONCLUSIONS



Don’t
go it alone: follow industry best practices and
standards


Set
yourself up for
success


It
is impossible to overstate the importance of document
analysis


Use
analysis as an opportunity to correct known
ambiguities


Recognize difference between bad and incorrect data


Create
a detailed document
map


XPATH
training is
valuable


Use Schematron as a central piece to QC
process


Work as a team


40

JATS
-
CON 2012

October 16, 2012


We
chose to use pre
-
existing JATS
DTD elements
and
avoid any JATS module customization. The stock NISO
JATS was more than sufficient to accommodate AIP’s
tagging needs. We were able apply our tagging
principles and remain true to our business rules.


We
have achieved the
XML

quality
we were aiming towards.


41

JATS
-
CON 2012

October 16, 2012



Questions?




42