Text Mining:
Opportunities and Barriers
John
McNaught
Deputy Director
National Centre for Text Mining
John.McNaught@manchester.ac.uk
Topics
•
What is text mining? (briefly)
•
What can it offer? (selectively)
•
What are the obstacles? (mostly)
NaCTeM
•
First publicly
-
funded (JISC) national text
mining centre in the world
•
Remit: provide services to research
community
•
Initial focus on biology, then social
sciences, medicine, chemistry, …
•
Processing on a large scale, e.g. for
UKPMC (
Wellcome
T.+17 other funders)
•
www.nactem.ac.uk
What is text mining?
•
Goal: Discover new knowledge from old
•
How:
–
Process very large amounts of text
•
Millions of documents, the more the better
–
Identify and extract information
–
(Link extracted information to already
curated
knowledge)
–
Mine to discover implicit significant associations
–
Flag (unknown) associations for researcher to
investigate further
–
Spin
-
off on the way: render information explicit
From text to new knowledge
What does it offer?
•
Finds unsuspected knowledge
–
E.g. Disease
-
gene associations
•
Enables discoveries human effort could not
achieve (information overload/overlook)
•
Enables better search/navigation of literature
–
Semantic search via extracted semantic metadata
•
Reduces time spent searching
–
15
-
48% of researcher time spent on classic search,
20
-
50% of classic searches unsatisfied
•
E.g. Systematic reviews: months to weeks
What does it offer?
•
Text mining boosts research
–
Makes research possible that would otherwise
be impossible or unfeasible
•
Research drives growth and innovation
•
Research produces more information
•
More information is available for text
mining
•
Text mining boosts research …
Barriers
•
Access to the literature
•
Format issues (tied to next point…)
–
“PDF is evil” (Lynch)
•
Main blocks: copyright and licensing issues
–
<8% of scientific claims found in full article
appear in its abstract (Blake)
–
Abstracts deficient on argumentation,
discussion, methods, background, …
–
Full texts needed to
realise
full benefits of TM
Barriers
•
Need to copy documents to
analyse
them
•
Licences
typically not
favourable
to TM
•
Licences
established on per institution basis
–
Prevents community
-
oriented services
•
Results only for internal use by institutional users
–
Hinders mining over collections of content from
different providers
•
Inconsistency: human can search and
manually
analyse
, but cannot use machine to
do same job on same data already
subscribed to
Barriers
•
Problem even with liberal OA
licences
–
Author attribution required
•
Author attribution in a data mining
environment is impossible/unfeasible
–
Association finding: cannot track positive,
negative, neutral individual author contributions
•
Derived works in a TM environment
–
Every author of every text processed to produce
new derived knowledge may have a claim…
–
Rights clearance thus an effective barrier
Barriers
•
Laudable effort 1: NESLi2
model
licence
(JISC Collections) allows TM
–
Publisher <> single institution
–
But how many publishers retain TM provisions?
–
But cannot display annotations produced by TM
on document itself
•
Laudable effort 2: NPG
licence
for self
-
archived content allows TM
–
But “content must be destroyed when experiment
complete” is vague. So services for community?
Conclusion
•
Copyright and licensing restrictions block full
realisation
of TM benefits
–
Economic savings and potential for growth are
stifled
•
Japan has introduced an information analysis
exception to copyright law
–
National Diet Library (= British Library) has
recently changed its motto to:
“Through knowledge we prosper”
–
Can we say the same in the UK?
Extras
Info=degree of surprise
Finding unknown associations:
reproducing a discovery reported 5 days ago in Nature Medicine
UKPMC
EvidenceFinder
by
NaCTeM
:
Questions generated by deep analysis, with known answers
Click on a question to see relevant extracted evidence
(from OA subset of the archive)
Enter the password to open this PDF file:
File name:
-
File size:
-
Title:
-
Author:
-
Subject:
-
Keywords:
-
Creation Date:
-
Modification Date:
-
Creator:
-
PDF Producer:
-
PDF Version:
-
Page Count:
-
Preparing document for printing…
0%
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο