Research Proposal - University of South Australia

raviolirookeryΒιοτεχνολογία

2 Οκτ 2013 (πριν από 4 χρόνια και 11 μέρες)

68 εμφανίσεις

BACHELOR OF COMPUTER

SCIENCE (HONOURS)
-

UNIVERSITY OF SOUTH
AUSTRALIA

Curating Biomedical
Literature using Text
Mining

Research Proposal

Samuel O’Malley

110015053


OYMSJ001

31
st

of May
2012


Supervisor:

Professor Jiuyong Li

Associate Supervisor:

Dr Jixue Liu




i

Abstract

Biomedical literature is increasing exponentially and manual curation processes are not
recording the facts fast enough. Advances in natural language processing and text mining
enable computers to assist in the curation process by categorising data into me
aningful groups
so that curators only see the literature they are looking for. Also these tools can be powerful
enough that they can automatically curate the data without any human input. Currently a few
solutions exist for automatically discovering protei
n
-
protein interactions from biomedical
literature, however there is a clear lack of tools for microRNA literature. MicroRNA research
is increasing as the technology for deep sequencing becomes cheaper and the interest in
microRNA is growing. MicroRNA recog
nition has challenges due to the large number of
synonyms and the large number of species which are referred to in the literature. The research
proposed here will provide a solution to microRNA recognition and attempt to automatically
extract information f
rom biomedical literature abstracts and generate a structured database of
facts.

2

Contents

1.

Introduction

4

1.1

Background and Motivation

4

1.2

Research Question

4

1.2.1

microRNA Entity Recognition

5

1.2.2

microRNA Relationship Detection

5

1.3

Justification

5

2.

Literature Review

6

2.1

Text Mining

6

2.1.1

Definition of Text Mining

6

2.1.2

Entity Recognition

6

2.1.3

Information Retrieval

6

2.1.4

Information Extraction

6

2.2

Mining Biomedical Literature

6

2.2.1

DRENDA Disease Related Enzyme information Database

6

2.2.2

Gene Name Normalisation

7

2.2.3

BioPPIExtractor: Protein
-

Protein Interaction Extractor

8

2.2.4

Biolexicon

8

2.2.5

miRCancer

8

3.

Methodology

9

3.1

Data Acquisition

9

3.2

Pre
-
processing

10

3.3

Entity Recognition

10

3.4

Relationship Analysis

10

3.
5

Results Analysis

10

3.6

Expected Results

11

4.

Project Schedule

12

5.

Summary

13

6.

References

14




3

List of Figures

Figure 1: Schematic Illustration of DRENDA workflow (Sohngen, Chang & Schomburg 2011)

7

Figure 2: Process flow diagram

9


List of tables

Table 1: Structured Database Output


randomly chosen examples

11






4

1.

Introduction

1.1

Background

and Motivation

MicroRNA are tiny
single strand lengths

of

non
-
coding

RNA which inhibit protein production
in our cells. They occur naturally in the body and can potentially cure a disease or condition.
Current microRNA research is aimed at discovering the links between d
ifferent microRNA
and protein production. Researchers also aim to artificially introduce microRNA into cells to
reduce problem proteins to potentially cure cancers or diseases
(
Xie 2010
;
Liu et al. 2012
;
Selth et al. 2012
;
Zhang et al. 2012
)
.

MicroRNA research

measured in the number of published articles and journals

is increasing
considerably
as technology is becoming cheaper and it is becoming relatively easier to
discover new MicroRNA. Although their existence
was discovered in

1993

by a
n American

molecular biologist Victor Ambros,

the technology

used

to discover
and sequence
new
microRNA has

only been widely available
and
for a few
short
years

(
Roads 20
10
)
.

Due to this
volume of
new research the data needs to be represented in a structured format in order to be
useful. Currently the databases used to store this information are curated manually by teams
of domain experts, however these databases to not

adequately reflect the current state of
research and no one researcher can be an expert in their field
(
Jensen, Saric & Bork 2006
)
.
Literature mining to
ols are becoming essential for researchers to enable them to partition the
information to only relevant publications, and potentially discover new information.

Automatic curation methods using text mining have

already been developed
for

other fields

in
bio
logy such as p
rotein

-

p
rotein
interactions

however
;

these methods
cannot be directly
applied to MicroRNA due to some limitations discussed in section
Error! Reference source
ot found.
.

1.2

Research Question

The overall aim of this research is to determine a good technique
of extracting information
about microRNA interactions from biomedical literature. This research can be split into two
problems:

1.

Recognising microRNA

occurrences and removing ambiguity

2.

Determining the relationship between co
-
mentioned microRNA and some other
biological entities.


5

1.2.1

microRNA Entity Recognition

This research will endeavour to accurately detect occurrences of microRNA in biomedical
literature. There are many challenges faced in this research because each
microRNA has
many synonyms and can be

ambiguous.

1.2.2

microRNA Relationship Detection

MicroRNA can
occur in the same sentence as many different types of biological terms.

Relationship detection will take the microRNA and other biological term and analyse the
relationship in order to classify the information as meaningful or not. An example relationship
would be “MicroRNA (A) inhibits Gene (B) Production” where A and B are microRNA and
gene name respectively.

1.3

Justification

This research has similarities to current research in other fields of biomedical text mining,
such as protein
-
protein detection and ge
ne name normali
s
ation

(
Crim, McDonald & Pereira
2005
;
Sun et al. 2009
;
Gerold, Simon & Fabio 2011
;
Xia et al. 2011
)
.

However

due to
microRNA being a relatively new field of research,

there is a clear lack of tools

for assisting
in curating microRNA information from biomedical literature
. This research will adapt and
extend existing tools for similar biology fields and apply them to microRNA, as discussed in
the literature review in
Section
2.2
.


6

2.

Literature Review

2.1

Text Mining

This section provides an overview of Text Mining research and current applications.

2.1.1

Definition of Text Mining

Data mining is the endeavour of discovering previously unknown information from data.
Text
mining is a subset of data mining with the ultimate aim of discovering new information from
free text literature. The three parts of text mining are Entity Recogniti
on (ER), Information
Retrieval (IR) and Information Extraction (IE)
.

2.1.2

Entity Recognition

Entity Recognition

(ER)

is a subset of text mining aimed at recognising important entities in
free
-
text. For our research this includes recognising microRNA and gene na
mes in biomedical
literature
. Some challenges presented in ER research include disambiguating entity names and
normalisation.

2.1.3

Information
Retrieval

Information Retrieval

(IR) encompasses advanced queries which go beyond simple keyword
searches. IR includes

entity recognition and clustering algorithms to provide better results to a
user’s query.

2.1.4

Information Extraction

Information Extraction (IE) goes one step beyond IR in that instead of providing results to a
query, it extracts facts from the literature and

returns these instead of the full
-
text.

2.2

Mining Bio
medical
Literature

This section will provide an overview of the more specific field of text mining biomedical
literature.

2.2.1

DRENDA
Disease Related Enzyme information Database

DRENDA is a system developed by
Sohngen, Chang and Schomburg
(
2011
)

for detecting and
classifying disease
-
related enzyme information.


7


Figure
1
: Schematic Illustration of DRENDA workflow
(
Sohngen, Chang & Schombur
g 2011
)

From the DRENDA workflow diagram in
Figure
1

we see that the system uses the BRENDA
database (
BRaunschweig ENzyme Database
) and MeSH Database (MEdical Subje
ct
Headings) as dictionaries for entity recognition. Literature is obtained by crawling PubMed
and extracting abstracts, initial
pre
-
processing

is applied such as sentence splitting. A training
corpus is used to train the SVM (Support Vector Machine) algor
ithm, which generates a
classification model. Sentences with co
-
occuring disease and enzyme mentions are extracted
and this SVM classification model is applied. The result is a set of classified sentences which
is evaluated by using a Test corpus. Correctl
y evaluated sentences are added to the DRENDA
database as facts.

This system cannot be directly applied to microRNA
literature;

however the workflow can be
followed closely
.
Before this system can be extended for microRNA literature, an appropriate
microRN
A dictionary resource must be identified. The evaluation methods used
by Sohngen
et

al
.

are very thorough and evaluate multiple pre
-
processing methods in order to determine
the best ones.

2.2.2

Gene Name Normalisation

A problem with biomedical literature is that

each entity has many different names and

there
are

complex naming conventions which might not be faithfully followed.

Naming
conventions include capitalisation to represent different species
, this convention might not be
followed if the context of the lit
erature makes it clear what species is being discussed
.

Sun,
Wang and Lin
(
2009
)

present a multi
-
level disambiguation framework for gene name

8

normalization
. The authors show that human genes have on average 5.5 synonyms for each
identifier.
While a human reader would understand these
using
contextual clues
, a machine
has

a much harder time understanding.

Sun et. al.
endeavour to

introduce

a

context awareness

algorithm to disambiguate species
amongst the
differ
ent synonyms used i
n the
literature
.

For example if the majority of genes
mentioned

in a document are human genes, then we can safely assume that any ambiguous
gene names in the document are also human genes.

The authors use a maximum entropy model and binary classes of meaningful and not
meaningful to disambiguate gene names. This algor
ithm is similar to Crim, McDonald and
Pereira’s algorithm
(
2005
)

except it uses more

contextual cues to disambiguate gene names.

2.2.3

BioPPIExtractor
: Protein
-

Protein Interaction Extractor

This system extracts protein


protein interactions from biomedical literature using
s
yntactic

grammar parsers to further understand the relationship betw
een two proteins
(
Yang, Lin &
Wu 2009
)
.

The system presented here

was manually evaluated for precision and recall, and
was found to perform

better than two
other leading systems BioRAT
(
Corney et al. 2004
)

and
IntEx
(
Silberztein 2000
)
.

2.2.4

Biolexicon

The Biolexicon is a large
-
scale lexical resource
of

bio
logical

terms
(
Thompson et al. 2011
)
. It
combines multiple data sources into one large resource which can be used at multiple stages
of the text mining process. This system
uses

its vast knowledge of biological terms to
discover new
tex
tual
variants which do not occur in the database

resources
.

Although this system is very useful, it has no knowledge of microRNA entities. It can assist
our efforts in microRNA detection because it has knowledge of biology specific verbs such as
“retro
-
reg
ulate” which do not occur in a standard dictionary

(
BOOTStrep Bio
-
Lexi
con

2012
)
.

2.2.5

miRCancer

MiRCancer is a comprehensive database for microRNA expression profiles in human cancers
based on experimental results
(
Xie 2010
)
. Essentially this framework is specifically designed
to uncover relationships between microRNA and cancers in biomedical literat
ure.

This system has a limitation of which the relationship between the microRNA and the cance
r
is not detected or analysed. This would result in false positives or unimportant data in the
miRCancer database.


9

3.

Methodology

The following diagram (
Figure
2
) represents the process flow that our program will take. The
order is symbolic for the Text Mining processes and will closely match the physical software
representation.


Figure
2
: Process flow diagram

3.1

Data Acquisition

Data will be acquired from the PubMed open access database and will only include titles and
abstracts. There are two reasons for only extracting abstracts and titles for our data
acquisition.
Firstly Titles and Abstracts are freely available and do not require any complex
PDF processing, this reduces the complexity and processing time of our algorithm. Secondly
the work by Wei and Collier
(
2011
)

suggests that most of the important terms are mentioned
in the Abstract and Title, and repeat
ed with more detail in the Introduction, Results and
Conclusion sections. This suggests that if there are no
occurrences

of microRNA in the title or
abstract then the full paper is not worth reading. To future proof our research all abstracts will
be store
d in a MySQL database and paired to the permanent URL in order to allow full
-
text
downloads at a later date.

Results Analysis

Precision

Recall

Relationship Analysis

Classify relationship based on joining words

Entity Recognition

mirBase Dictionary

Disambiguation

Preprocessing

Stop word removal

Tokenization

Data Acquisition

Crawl PubMed Database

Extract Abstracts


10

3.2

Pre
-
processing

Common text mining
pre
-
processing

tasks will be applied to our data. Firstly tokenisation will
be applied to separate the sentence i
nto tokens (words without any punctuation). Then
commonly occurring English language words called Stop Words will be removed.

M
icroRNA
and other
medical entities will then be removed from the sentence

in order to reduce
confusing the classification algorit
hm
. Completely removing medical entities has been
showed to perform better in classification tasks, compared to replacement with a generic word
(
Sohngen, Chang & Schomburg 2011
)
.

3.3

Entity Recognition

The MIRBase will be used to facilitate microRNA entity recognition
(
Kozomara & Griffiths
-
Jones 2
011
)
. This database contains manually curated microRNA information including deep
sequence data which is
the unique sequence of amino acids which make up a microRNA
. The
most useful information contained in this database are various synonyms used to ref
er to
individual microRNA and a unique identifier which can be used to
refer to microRNA
without any ambiguity.

Various microRNA databases were evaluated
for biomedical
applications and MIRBase was shown to be an extensive resource valuable for annotation
tasks
(
Tan Gana, Victoriano & Okamoto 2012
)
.

3.4

Relationship Analysis


3.5

Results Analysis

Precision and Recall are the standard measure for evaluating text mining algorithms. However
the
re is no gold standard available for microRNA literature so a manual analysis will need to
be performed. A small test dataset will be compiled manually and used to evaluate our
algorithm. Precision and recall can be used to compare different algorithms eve
n across
different fields, this means that our algorithm can be compared to existing algorithms which
do not related to microRNA. This is useful because there is
currently
very little research into
automatic microRNA curation.


11

3.6

Expected Results

Table
1
: Structured Database Output


randomly chosen examples

MicroRNA

Entity

Class

hsa
-
mir
-
150

alpha
-
1
-
B glycoprotein

Meaningful

hsa
-
mir
-
7a
-
1

apoptosis
-
associated tyrosine kinase

Meaningful

If the research is successful, the outcome will be a structured database containing a
microRNA, another biological entity

which will initially be gene names but will expand to
include diseases and other entities,

and a classifier (See
Table
1
)
. At this stage the classifier is
binary of only meaningful or not meaningful, however after analysis of the returned data we
might need to introduce further classifications.



12

4.

Pro
ject
Schedule

This section outlines the proposed

high level

schedule of the research project.

Date

Task

February

Literature Review

March

April

May

Research Proposal

June

July

Data Acquisition (Section
3.1
)

Pre
-
processing (Section
3.2
)

August

September

Entity Recognition (Section
3.3
)


Relationship Analysis (Section
3.4
)

October

Testing and Evaluation

(Section
3.5
)

Preparation of Thesis

November





13

5.

Summary

This research project is motivated to combine computing power with biomedical domain
knowledge to assist in the process of curating microRNA literature. Even though no algorithm
will be infallible and able to replace the manual curation process co
mpletely, the added speed
advantage of computer processing will greatly advantage the curator’s task.

A challenge
addressed in this research is recognising microRNA entities and their variations in biomedical
literature.


14

6.

References

BOOTSt
rep Bio
-
Lexicon

2012, The National Centre for Text Mining
-

University of
Manchester, <
http://www.nactem.ac.uk/biolexicon/>
.


Corney, DPA, Buxton, BF, Langdon, WB & Jones, DT 2004, 'BioRAT: extracting
biological
information from full
-
length papers',
Bioinformatics,
vol
.
20, no. 17, November 22, 2004, pp.
3206
-
3213.


Crim, J, McDonald, R & Pereira, F 2005, 'Automatically annotating documents with
normalized gene lists',
BMC Bioinformatics,
vol
.
6, no. Su
ppl 1, p. S13.


Gerold, S, Simon, C & Fabio, R 2011, 'Detection of interaction articles and experimental
methods in biomedical literature',
BMC Bioinformatics,
vol
.
12, no. Suppl+8, p. S13.


Jensen, LJ, Saric, J &

Bork, P 2006, 'Literature mining for the biologist: from information
retrieval to biological discovery',
Nat Rev Genet,
vol
.
7, no. 2, pp. 119
-
129.


Kozomara, A &

Griffiths
-
Jones, S 2011, 'miRBase: integrating microRNA annotation and
deep
-
sequencing data',
Nucleic Acids Research,
vol
.
39, no. suppl 1, pp. D152
-
D157.


Liu, J, Gao, J, Du, Y, Li, Z, Ren, Y, Gu, J, Wang, X, Gong, Y, Wang, W & Kong, X 2012,
'Combination

of plasma microRNAs with serum CA19
-
9 for early detection of pancreatic
cancer',
Int J Cancer,
vol
.
131, no. 3, Aug 1, pp. 683
-
691.


Roads, RE 2010,
Progress in Molecular and Subcellular Biology
, Springer, Shreveport LA.


Selth, LA, Townley, S, Gillis, JL
, Ochnik, AM, Murti, K, Macfarlane, RJ, Chi, KN, Marshall,
VR, Tilley, WD & Butler, LM 2012, 'Discovery of circulating microRNAs associated with
human prostate cancer using a mouse model of disease',
Int J Cancer,
vol
.
131, no. 3, Aug 1,
pp. 652
-
661.


Silb
erztein, M 2000, 'INTEX: an FST toolbox',
Theoretical Computer Science,
vol
.
231, no. 1,
pp. 33
-
46.


Sohngen, C, Chang, A & Schomburg, D 2011, 'Development of a classification scheme for
disease
-
related enzyme information',
BMC Bioinformatics,
vol
.
12, no.

1, p. 329.


Sun, C
-
J, Wang, X
-
L, Lin, L & Liu, Y
-
C 2009, 'A Multi
-
level Disambiguation Framework for
Gene Name Normalization',
Acta Automatica Sinica,
vol
.
35, no. 2, pp. 193
-
197.



15

Tan Gana, NH, Victoriano, AFB & Okamoto, T 2012, 'Evaluation of online miR
NA resources
for biomedical applications',
Genes to cells : devoted to molecular & cellular mechanisms,
vol
.
17, no. 1, pp. 11
-
27.


Thompson, P, McNaught, J, Montemagni, S, Calzolari, N, del Gratta, R, Lee, V, Marchi, S,
Monachini, M, Pezik, P, Quochi, V,
Rupp, C, Sasaki, Y, Venturi, G, Rebholz
-
Schuhmann, D
& Ananiadou, S 2011, 'The BioLexicon: a large
-
scale terminological resource for biomedical
text mining',
BMC Bioinformatics,
vol
.
12, no. 1, p. 397.


Wei, Q & Collier, N 2011, 'Towards classifying specie
s in systems biology papers using text
mining',
BMC Research Notes,
vol
.
4, no. 1, p. 32.


Xia, N, Lin, H, Yang, Z & Li, Y 2011, 'Combining multiple disambiguation methods for gene
mention normalization',
Expert Systems With Applications,
vol
.
38, no. 7, p
p. 7994
-
7999.


Xie, B 2010, 'miRCancer: a microRNA
-
Cancer Association Database and Toolkit Based on
Text Mining'.


Yang, Z, Lin, H & Wu, B 2009, 'BioPPIExtractor: A protein

protein interaction extraction
system for biomedical literature',
Expert Systems Wi
th Applications,
vol
.
36, no. 2, pp. 2228
-
2233.


Zhang, J, Zhao, H, Gao, Y & Zhang, W 2012, 'Secretory miRNAs as novel cancer
biomarkers',
Biochim Biophys Acta,
vol
.
1826, no. 1, Aug, pp. 32
-
43.