The PIMMS project and Natural Language Processing for Climate Science

addictedswimmingAI and Robotics

Oct 24, 2013 (4 years and 16 days ago)

82 views


The PIMMS project and Natural Language
Processing for Climate Science


Extending the Chemical Tagger natural language processing tool with
climate science controlled vocabularies


Charlotte

Pascoe, Hannah Barjat, Peter Murray
-
Rust and Gerry
Devine


June 9
th

2012, Open Repositories 2012

Portable Infrastructure for the
Metafor Metadata System

http://proj.badc.rl.ac.uk/pimms/

Software

Activity

Data

Grids

Quality

Shared

ISO

Some concepts
are shared

We can record the
quality of things

We reuse various ISO classes

We can talk about
DataObjects

collected together in any number of
ways, stored in a particular medium

We can talk about
hierarchical
M
odelComponents

with
M
odelProperties
,
some of which can be
coupled together

We can talk about
Simulations run in
support of Experiments.
Experiments
consist of
Requirements;
Simulations
conform to
Requirements

A particular Activity uses
a particular
SoftwareComponent

We can define a
GridSpec

or some other geometry

Common Information Model

Common Information Model

Mind maps are used to capture

information requirements from domain

experts and build a controlled vocabulary
.

Mind Maps

<component
name=
"Radiation"
>


<definition
status=
"missing"
>
Definition of component type Radiation required
</definition>


<parameter
name=
"
RadiativeTimeStep
"

choice=
"keyboard"
>


<definition
status=
"missing"
>
Definition of property name
RadiativeTimeStep

required
</definition>


<value
format=
"numerical"

name=
"time step"

units=
"time units"
/>


</parameter>


<
parametergroup

name=
"
Longwave
"
>


<parameter
name=
"
SchemeType
"

choice=
"XOR"
>


<definition
status=
"missing"
>Definition of property name
SchemeType

required
</definition>



<value
name=
"Wide
-
band model"
/>


<value
name=
"Wide
-
band (
Morcrette
)"
/>


<value
name=
"K
-
correlated"
/>


<value
name=
"K
-
correlated (RRTM)"
/>


<value
name=
"other"
/>


</parameter>


<parameter
name=
"Method"

choice=
"XOR"
>


<definition
status=
"missing"
>Definition of property name Method required
</definition>


<value
name=
"Two stream"
/>


<value
name=
"Layer interaction"
/>


<value
name=
"other"
/>


</parameter>


<parameter
name=
"
NumberOfSpectralIntervals
"

choice=
"keyboard"
>


<definition
status=
"missing"
>
Definition of property name
NumberOfSpectralIntervals

required
</definition>


<value
format=
"numerical"

name=
""
/>


</parameter>


</
parametergroup
>

The
python parser processes the XML files generated by the mind maps

Python Parser

http://q.cmip5.ceda.ac.uk/

Web Forms

Web forms generate content in CIM xml format

http://zonda5.badc.rl.ac.uk/site/public/tools/viewer/integrated/1.5/en/73c59aba
-
dc6d
-
11df
-
a442
-
00163e9152a5/1

CIM Viewer

http://chemicaltagger.ch.cam.ac.uk/

ChemicalTagger

is an open
-
source tool that uses OSCAR4 and NLP techniques for tagging and
parsing experimental sections in the chemistry literature.

Chemical Tagger


Java project Developed
by the Peter Murray
-
Rust group,
Cambridge
. Online
demo:
http://chemicaltagger.ch.cam.ac.uk
/



Adapted for use with ACP Abstracts (
Lezan

Hawizy

and
Hannah Barjat).



Modification by use of dictionaries and changes to grammar.


First use case outside of laboratory chemistry.


Still with a significant chemistry component.


Wider physical science
.



Open Source NLP tool for processing chemical text


Combines Chemical Entity Recognitions (OSCAR) with NLP
techniques


Extendible and Reconfigurable Taggers and Parsers

Chemical Tagger


https://bitbucket.org/wwmm/chemicaltagger

&
https://bitbucket.org/wwmm/acpgeo





Open Source NLP tool for processing
chemical text


Combines Chemical Entity Recognitions
(OSCAR) with NLP techniques


Extendible and Reconfigurable Taggers
and Parsers generated using ANTLR
(
ANother

Tool for Language Recognition)

11


To extend chemical tagger to be more suited to
climate
modelling
.


Specifically:


Palaeoclimate

modelling

and how process of text mining
might differ from development of a controlled vocabulary.


High
-
lighting of text for comparison with CIM documents.


Initially only using XML Abstracts e.g. from EGU’s
Geoscientific

Model Development and Climate of the Past.


Brief look at PDF to Text.

Chemical Tagger & PIMMS


Time periods and climatic
events


Includes
named Ages, Epochs, Eras etc. [Including all those in a
mind map
produced
for the
PIMMS project
at Bristol
].


context
of proper nouns e.g. with words such as ‘period’, ‘era’, ‘
epoch’


Numbers
with appropriate units e.g.
Mya
, yr
BP


Likely
date numbers e.g. 1750
AD.


Acronyms


known’LGM
’ e.g. [in context ACRONYMS have not been
investigated]


Related adjectives e.g. seasonal, decadal, glacial, interglacial,
stadial
,
interstadial
,
maximum, minimum where used as proper nouns.



Palaeoclimate

Models


Can guess model names from context


e.g. proper noun or acronym followed by model


e.g. reconstruction / simulation with XXX


Can develop/use glossary of model names.



Palaeoclimate

Acronyms


Time periods and models.


Theories, techniques, physical and chemical parameters?


Can develop/use glossary of acronyms


problem area: often not unique even
within subject.

Paleoclimate

Language

13



Quick
compilation of proper nouns used for time periods
(primarily from Wikipedia) contains 185
words.


Use of these words together with adjective/ dates / details of
events would produce a very large number of phrases.



Controlled
Vocabulary from Bristol contains around 24 of
these
.


Use of these words together with other proper nouns /
adjectives / dates gives only 44 phrases within the Bristol CV.



Map natural language to CV?


Straightforward for most dates?


Understanding of context important


Does context refer to main emphasis of paper?


Is an event/time period described unambiguously? e.g. “Last Glacial




Natural Language
vs

CV

Tag / Tags

Example

Comment

<timePhrase>

<PALAEOTIME>

(i) Holocene, (ii) 8 kyr BP
(iii)

<referencePhrase>

(i) (Otto et al. 2009b)

(ii) Giraudeau et al. 2000

Important to distinguish
year pattern from dates
relevant to the study.

<locationPhrase>

(i) around Lake Kotokel,
(ii) over Tibetan Plateau



False positives: e.g. “from
Sphagnum”

<LOCATION>

(i) 52
°
47
´

N, 108
°
07
´

E,
458 m a.s.l (ii) London.

Cannot currently do
degrees from pdf
-
text.

<TempPhrase>

‘warm’ and ‘cool’: verbs in
synthetic chem unlike env.
chem.

Preliminary Results


Preliminary Results (from 68 files)

Tag / Tags

Example

Numbers found

<CAMPAIGN>

(i) PMIP, (ii) PANASH

Less relevant here than to
ACP in general

<MODEL>

(i) REVEALS model, (ii)
ECBILT
-
CLIO intermediate
complexity climate model

<acronymPhrase>

(i) Modern Analogues
Technique ( MAT )

(ii) REVEALS ( Regional
Estimates of VEgetation
Abundance from Large
Sites )

May pick up campaigns /
models where phrases
above have failed.

<QUANTITY>

(i) 10 ppm (ii) 0.53 mm/day

units dictionary could be
more extensive

<MOLECULE>

(i) CO2, (ii) calcium
carbonate

Many false positives as
what chemical tagger was
designed for.

16


XML rendered with CSS

Chemical Tagger
Rendering of PALEOTIME

http://www.clim
-
past.net/2/205/2006/cp
-
2
-
205
-
2006.html

http://www.geosci
-
model
-
dev.net/4/1035/2011/gmd
-
4
-
1035
-
2011.html

GMD Journal Article


The
acronym / name

MIROC4
is not explained


so


reproduce
sentence


The
description is just
first few sentences after

appearance
of

<
MODEL>

CIM Document Viewer

Makes use of existing
chemical tagging.

CIM Document Viewer

http://zonda5.badc.rl.ac.uk/site/public/repository


Number
of spectral
intervals were not
found! No place for
“not found”

CIM Document Viewer

http://zonda5.badc.rl.ac.uk/site/public/repository


Unless paper is specifically about the model we
are unlikely to find much MEAFOR type CV in
the abstract


Look at experimental / methods sections


model name


model resolution


model schemes


Problem with PDF
-
> text.


Only certain elements easy to extract (e.g.
resolution)

Climate Models


General Constraints

22


Add a few more phrases e.g. specific phrases to
look for model resolution, using expected
vocabulary (e.g. grid, levels, resolution, directions
etc).


Refine output of ACPgeo to look for specific CV
terms.


Try to put CV terms in context:


Look for proximity of CV terms to other phrases:


Within phrase; within sentence or within a number of
sentences

Refine
ACPgeo

Output


Chemical Tagger was designed to be used primarily with
chemistry.


Unsurprising that there is a tendency to
to

assign acronyms;
hyphenated words; and words with common chemical
endings as molecules.


It is possible to filter some of these wrongly assigned words by
probability.


There are still conflicts e.g. C3 and C4 could refer to
hydrocarbons or plants.


Extensive testing and modifying / machine learning might
reduce these.


Better to get right first time if important!

<MOLECULE>

http://proj.badc.rl.ac.uk/pimms/blog/

CIM was designed to be populated by modellers with the (probably over simplistic) assumption
that if something isn't in the CIM document then it either isn't in the model or isn't relevant. But
CIM documents created by harvesting information from papers will naturally not cover
everything about a model, so missing info doesn't mean that those things weren't
included/aren't relevant
.


PIMMS will need to describe different protocols for interpreting CIM documents depending on
how they were created, but we will also want to ensure that that CIM accounts for missing data
more intelligently in future releases
.


In essence the difference between journal article descriptions and metadata documentation is
Narrative. Journal articles need to tell a story so the information they include is only that which
is relevant to the narrative, whereas metadata documentation is an attempt to include as much
as possible across the board. The general nature of metadata documentation is probably why it
has historically been perceived as such a boring task to complete.


PIMMS
will make metadata documentation more fun by bringing back the Narrative, once
PIMMS is established at an institution users will be able to create generalised metadata having
only described those things that are relevant to the story of their experiment.

Harvested Metadata
vs

Documented Metadata