Lichens Bryophyte and Climate Change - iDigBio

molassesitalianAI and Robotics

Nov 6, 2013 (3 years and 9 months ago)

72 views

Edward Gilbert

Corinna Gries

Thomas H. Nash III

Robert Anglin


16 digitization centers


> 60 non
-
governmental
US herbaria (95%)


Mexico, US, Canada


~ 2.3 million specimen


90% of all specimens


900,000 lichens


1.4 million bryophytes


http://lbcc.limnology.wisc.edu/


Lichen Consortium


http://lichenportal.org


Started in 2009


24 Collections


~ 797,916 Records


Bryophyte Consortium


http://bryophyteportal/


Started in 2010


16 Collections


1,059,063 Records

Imaging Stage

Capture Image

b
arcode in
f
ile
name

Create
Skeleton File

barcode,
species name,
exsiccati
, etc.

Upload to
FTP server

Image processing

extract barcode,
create web
versions, map to
portal DBs

Duplicate
Harvesting

Existing
Herbarium
Database

Automated Processing

OCR / NLP /
Georeferencing

augmented with raw OCR,
parsed fields, coordinates,
etc
.

Existing Record

s
imply link
image

Upload to
FTP server

Image URLs

Manage
Specimen Data
in Portal

Manage / Review
Records in Portal

Symbiota

Editor

review, edit,
keystroke, and
finalize

Create New Record

barcode, image,
skeletal data


Image all specimen / specimen labels


Collect and load skeletal data


Barcode, scientific name, country, state


Upload
to portal


Record exists => link image to existing record


Record absent => create empty
“unprocessed” record


Automated OCR label


Block
of raw
text => database


Automated
NLP (field parsing)


Review
data


Keystroke full record


Collector name & number => look for dups


Reparse full record => learnable
parsers


Tesseract

V3


Dual cycle


Automatic


Manual review


Expected hurtles


Handwritten labels


Old fonts


Faded labels


Form labels


Adjustable image
variables



¢_].L.|
ȉ

˜Â¢ .'
».f.'._..â

˜~,(.J

fin
-


˜*
\
'a:"511z:1 wf .~
\
:'i/.onli State University

P.â

™~.r"~2= ,_.
gg

J:.2 " J*J*" â

• (=:
\
â

˜
-
â

œax

"»..'
\
-
12

â

˜
â

œ



˜ ;T~;â

˜~7i?»
-
1_1_
\
f;>
sf
`;,' ESX

Z»ie+â

˜
-
».
â

œ
~'.
»te
;~:i_.t<»
ff`t
;~f3":.f.â

œ

» »4 xx, ,

"""â

˜
â

ω


T"â

™ <1;
-
.
rs

f3'a,1.z>.t;;
a¢f~rus

â



V4 J 'if .

°

°

M '1?nies
ivain
.)
Sav
.

neutal

Station
-

" '1 ~
»r
';;4
-
\
P ` 1.

T11 ./P.. ,J ..
-
.

ELEV. ' `.
fJL
_
\

LATL Q _â

˜ 1 _


™ DATE

_ ,. W5. (> f
-

,
-


˜; i f>i_T ~~ . A 1:

». v
\

.
-
v »~. 4. a xvala 8/27/73

PLANTS OF NEW r~1ExIco

Herbarium of Arizona State University

Parmelia

ulophyllodes

(Vain.)
Sav
.

COUNTY
â

œÂ
°
â


â

ω

œ


Joranada

Experimental Station
-

New Mexico State University



ω

ω

œ' on Juniperus

ELEV. ‘ 4400

EEILLEETUR DATE

DU T. H. Nash #7914 8/27/73

T. H. N.


1.
Iterate through new “unprocessed” images

1.
81439 bryophytes images

2.
147122 lichens images

2.
OCR via
T
esseract

(version 3)

a)
Untreated image

b)
Treated image (contrast, brightness,
etc
)

3.
Store raw text linked to skeletal record

4.
Progress to next step

1.
Low OCR return => hand processing

2.
“Unprocessed
-
OCR” => NLP



1.
Iterate through raw OCR text blocks

a)
147122 lichen OCR blocks

b)
81439 bryophyte
OCR blocks

2.
Collector, number, and date

a)
Attempt duplicate harvesting

3.
Field
-
by
-
field parsing

4.
Full
-
parsing

5.
Parsing based on NLP profiles

1.
E.g. targeted label formats





1.
Extract collector data

a)
Last name, number, date

2.
Harvest duplicates from consortium DB

a)
Exact
duplicates

b)
Duplicate events

3.
Compare return field
-
by
-
field

4.
Compare fields with raw OCR

5.
Populate fields that have high similarity
indexes

6.
Processing status: “pending review”







1.
Premise: Target similar label formats

2.
Use raw OCR to locate “Nash” labels

3.
Need to exclude:

a)
Determined by Nash

b)
Author of scientific name

c)
Associated collector

4.
Test for similarity to target label format

5.
Targeted parsing algorithms