Text Mining: Opportunities and Barriers

pogonotomygobbleΤεχνίτη Νοημοσύνη και Ρομποτική

15 Νοε 2013 (πριν από 3 χρόνια και 11 μήνες)

79 εμφανίσεις

Text Mining:

Opportunities and Barriers

John
McNaught

Deputy Director

National Centre for Text Mining

John.McNaught@manchester.ac.uk


Topics


What is text mining? (briefly)


What can it offer? (selectively)


What are the obstacles? (mostly)


NaCTeM


First publicly
-
funded (JISC) national text
mining centre in the world


Remit: provide services to research
community


Initial focus on biology, then social
sciences, medicine, chemistry, …


Processing on a large scale, e.g. for
UKPMC (
Wellcome

T.+17 other funders)


www.nactem.ac.uk


What is text mining?


Goal: Discover new knowledge from old


How:


Process very large amounts of text


Millions of documents, the more the better


Identify and extract information


(Link extracted information to already
curated

knowledge)


Mine to discover implicit significant associations


Flag (unknown) associations for researcher to
investigate further


Spin
-
off on the way: render information explicit


From text to new knowledge


What does it offer?


Finds unsuspected knowledge


E.g. Disease
-
gene associations


Enables discoveries human effort could not
achieve (information overload/overlook)


Enables better search/navigation of literature


Semantic search via extracted semantic metadata


Reduces time spent searching


15
-
48% of researcher time spent on classic search,
20
-
50% of classic searches unsatisfied


E.g. Systematic reviews: months to weeks


What does it offer?


Text mining boosts research


Makes research possible that would otherwise
be impossible or unfeasible


Research drives growth and innovation


Research produces more information


More information is available for text
mining


Text mining boosts research …



Barriers


Access to the literature


Format issues (tied to next point…)


“PDF is evil” (Lynch)


Main blocks: copyright and licensing issues


<8% of scientific claims found in full article
appear in its abstract (Blake)


Abstracts deficient on argumentation,
discussion, methods, background, …


Full texts needed to
realise

full benefits of TM



Barriers


Need to copy documents to
analyse

them


Licences

typically not
favourable

to TM


Licences

established on per institution basis


Prevents community
-
oriented services


Results only for internal use by institutional users


Hinders mining over collections of content from
different providers


Inconsistency: human can search and
manually
analyse
, but cannot use machine to
do same job on same data already
subscribed to

Barriers


Problem even with liberal OA
licences


Author attribution required


Author attribution in a data mining
environment is impossible/unfeasible


Association finding: cannot track positive,
negative, neutral individual author contributions


Derived works in a TM environment


Every author of every text processed to produce
new derived knowledge may have a claim…



Rights clearance thus an effective barrier



Barriers


Laudable effort 1: NESLi2
model

licence

(JISC Collections) allows TM


Publisher <> single institution


But how many publishers retain TM provisions?


But cannot display annotations produced by TM
on document itself


Laudable effort 2: NPG
licence

for self
-
archived content allows TM


But “content must be destroyed when experiment
complete” is vague. So services for community?


Conclusion


Copyright and licensing restrictions block full
realisation

of TM benefits


Economic savings and potential for growth are
stifled


Japan has introduced an information analysis
exception to copyright law


National Diet Library (= British Library) has
recently changed its motto to:




“Through knowledge we prosper”


Can we say the same in the UK?

Extras


Info=degree of surprise

Finding unknown associations:

reproducing a discovery reported 5 days ago in Nature Medicine

UKPMC
EvidenceFinder

by
NaCTeM
:

Questions generated by deep analysis, with known answers

Click on a question to see relevant extracted evidence

(from OA subset of the archive)