DOC - Evangelos Pafilis

bewgrosseteteΛογισμικό & κατασκευή λογ/κού

13 Δεκ 2013 (πριν από 3 χρόνια και 8 μήνες)

113 εμφανίσεις


VISUALISATION MECHANISMS

Supporting Genotype Analysis





Evangelos Pafilis










A dissertation submitted in part fulfillment of the requirement of the Degree
of

M.Res. in Bioinformatic
s at the University of Glasgow


September 2003


2

Abstract

Genome vi
sualisation is an important tool which supports the analysis of
genes involved in normal and abnormal a
ctivities of living organisms.
Responding to the emerging demands for genotype analysis data display
and genome variation rendering and retrieval, we pro
pose a prototype web
based application that aims to address both of these issues. We
experiment with using an existing visualisation tool in a novel context,
implement software that performs the visualisation logic and develop the
required data storage fac
ilities. The prototype we produced demonstrates
how an implementation with a wider data coverage could satisfy some of
the bioinformatics requirements
arising from genetics research.


3

Acknowledgements

I would like to thank the following people for their he
lp with the project:



Dr. Ela Hunt, for her support and guidance throughout the project



Dr. Fadi Charchar for helping me understand the biological background
of the project



Dr. Bailey for sharing his thoughts on bioinformatics issues



Mr. Andy Jones for shar
ing his ideas and opinions



Dr. Da
vid Leader, Dr. Neil Hanlon, Ms. Eilidh Grant and M
s. Suzan
Fairley for their assistance



Mr. Micha Bayer and Mr. Chris Wu for their help in technical issues



Mr. Jim Tourtouras and M
s. Areti Galani for their continuous supp
ort


4


Abstract

................................
................................
................................
......

2

Acknowledgements

................................
................................
....................

3

Abbreviations

................................
................................
..............................

6

1. Introduction

................................
................................
.............................

7

1.1 Need for visualisation

................................
................................
........

7

1.2 Project stimulus

................................
................................
.................

7

1.3 Genotype Analysis Data
................................
................................
....

8

1.4 Single Nucleotide Polymorphisms (SNPs)

................................
........

9

1.5 Project Approach

................................
................................
............

10

2. Design

................................
................................
................................
..

11

2.1 Preparation

................................
................................
.....................

11

2.1.1 Literature and Web Resource Survey
................................
.......

11

2.1.
2 Requirements Capture

................................
.............................

12

2.2 Genotype Analysis Data Visualisation Design

................................

13

2.3 Y Chromosome SNP Visualisation

................................
..................

14

2.4 Y Chromosome SNP Database Data Source Design

.....................

15

2.5 Overall Application Design

................................
..............................

16

2
.6 Y Chromosome SNP Database design

................................
...........

17

3. Implementation

................................
................................
.....................

19

3.1 YDB Implementation

................................
................................
.......

19

3.1.2 Implementing the Relational Schema

................................
.......

19

3.1.3 Inserting Data

................................
................................
...........

19

3.2 Presentation Tier

................................
................................
.............

22

3.3 Application Tier

................................
................................
...............

23

3.3.1 Genotype Analysis Data Visualisation Middleware

...................

23

3.3.2 Y Chro
mosome Variation Visualisation Middleware

.................

25

3.4 Implementation Summary

................................
...............................

27


5

4. Testing and Evaluation

................................
................................
.........

28

4.1 Testing

................................
................................
............................

28

4.2 Evaluation

................................
................................
.......................

29

5. Discussion

................................
................................
............................

30

5.1 Design Decisions

................................
................................
............

30

5.1.1 Only dbSNP as the Data Source for YDB

................................
.

30

5.1.2 Not only SNPs

................................
................................
..........

30

5.2 Implementation Decisions

................................
...............................

31

5.2.1 MySQL as the RDBMS

................................
.............................

31

5.2.2 Parsers in the Java Language

................................
..................

31

5.2.3 Both CGI and Servlets as Middleware Technology

..................

32

5.2.4 DerBrowser as the Visualisation Tool

................................
.......

32

5.3 Application Criticism: Achievements and Drawbacks

......................

33

5.4 Further Improvement
................................
................................
.......

34

6. Conclusio
n

................................
................................
............................

36

References

................................
................................
...............................

37

A. Literature

................................
................................
..........................

37

B. Internet Resources

................................
................................
...........

39

Appendix A

................................
................................
...............................

40

Appendix B

................................
................................
...............................

42

Appendix C

................................
................................
...............................

43

Appendix D

................................
................................
...............................

44

Appendix E

................................
................................
...............................

45

Appendix F

................................
................................
...............................

46


6

Abbreviations

ASN1: A
bstract Synt
ax Notation One

BPG:
Blood Pressure Group

BHF: British Heart Foundation


CGI: Common Gateway Interface

DIP: Deletion Insertion Polymorphism

HTML: Hyper Text Mark up Language

HTTP: Hyper Text Transfer Protocol

JDBC: Java DataBase Connectivity

RDBMS: Relatio
nal DataBase Management System

STR: Short Tandem Repeat

dbSNP: Single Nucleotide Polymorphism Database

SNP: Single Nucleotide Polymorphism

SQL: Structured Query Language

TSC: The SNP Consortium

YDB: Y Chromosome Variation DataBase


7

1. Introduction

1.1 Nee
d for visualisation

The development of high throughput techniques and large
-
scale studies in
the biological sciences has given rise to an explosive growth in both the
volume and types of data available to the researchers. The genome
sequence of many organi
sms is now known and is being annotated
constantly. Handling this ever increasing amount of information is no
longer a straightforward issue and therefore computational methods have
to be applied.

Accessing annotated sequence data in textual format can be
extremely
laborious and time consuming (Bryce, 2003). A graph that utilizes different
colours and shapes in order to render every piece of genomic information
associated with a particular region
,

would ease the process of data
interpretation and allow the
generation of conclusions. Thus, there is a
need for a visualisation tool that would display the exact location and
range of each genetic element existing in a genomic sequence, provide a
user with facilities such as the retrieval of further information an
d be easy
and intuitive to use.

1.2 Project stimulus

The driving force of this project was to provide a group of researchers at
the British Heart Foundation (BHF) Blood Pressure Group (BPG) of the
Division of Cardiovascular & Medical Sciences of the Univer
sity Of
Glasgow
[a]

with visualisation tools that would support their ongoing
experiments.

The visualisation of both genotype analysis results and genome variations,
focusing on Single Nucleotide Polymorphisms (SNP
s) was the
bioinformatics problem to be solved.


8

1.3 Genotype Analysis Data

Certain congenic rat strains have been genotyped for a series of markers
and the results were held in a Microsoft Excel® flat file (Figure 1).


Figure
1
: F
lat file containing genotype analysis

data

All the markers shown belong to rat chromosome 2. The rectangles
correspond to a certain genotype according to the colour code shown on
the right. The coloured rectangles give a visual overview of the marker
genot
ypes for each rat strain. However, if the researcher wishes to place
the markers on their relative locations on the chromosome, checking f
or
example their distribution,
then he/she is forced to traverse though the
base pair position
of each one.

In this fo
rm of data presentation the real
physical distances on the chromosome of interest are not shown, which
makes the interpretation of data incomplete.

We postulate that there is a
need for

an application that would accept this
flat file as input and return a

visualisation maintaining the same color code
and simultaneously rendering the markers
on their chromosomal position.


9

1.4 Single Nucleotide Polymorphisms (SNPs)

An SNP is a substitution of one base pair at a given location on the
genome. At position 11,
294,479 on human chromosome 7, for example,
some people have an A, while others have a G. On average, SNPs are
spaced every 300 bases throughout the human genome and are estimated
at nearly 10 million
(Orgen, 2003
). Each is a genomic landmark, a
surveyor's

marker that researchers can use to chart the location of disease
genes and heritable traits, for instance.

Most SNPs reside outside coding regions, exerting potential influence on
gene regulation and expression. Many researchers value these SNPs for
use
in association studies and whole
-
genome linkage
-
disequilibrium
mapping. In this type of analysis, maps of common, genome
-
wide
polymorphisms are used to unearth variations that are associated with, but
not causative of, medical conditions.

Some polymorphis
ms occur in protein
-
coding regions (cSNPs) and may
directly contribute to disease, disease susceptibility, and drug metabolism,
by altering gene function. (Orgen, 2003).

BPG researchers have already documented that the Y chromosome
harbors a locus or loci
that contribute to blood pressure variation in
hypertensive and normotensive men (Charchar
et al
., 2003). Their next
step is to perform haplotype analysis on a general sample of Polish males
and to try to associate high blood pressure with certain combinat
ions of
SNP allelic states, constituting a particular haplotype.

Additionally, because of its sex determining role, the Y chromosome is
male specific and constitutively haploid. It passes from father to son, and,
unlike other chromosomes, largely escapes
meiotic recombination
1
,
2
.

The
importance of escaping recombination is that haplotypes usually pass



1

Recombination is the formation of new combination of alleles though meioti
c crossover.
Some authors include intrachromosomal gene conversion under this heading. As this
has been shown on Y chromosome (Rozen
et al
., 2003), they prefer not to refer to it as
‘non
-
recombining’.

2

Two segments (the pseudoautosomal regions) do recombi
ne with the X, but these
amount to less than 3 Mb of its ~60
-
Mb length.


10

intact from generation to generation. They change only by mutation, rather
than the more complex reshuffling that other chromosomes experience,
and so prese
rve a simpler record of their history. Using binary
polymorphisms
1

with low mutation rates, such as SNPs, a unique
phylogeny can therefore easily be constructed. (Jobling
et al.
, 2003)

A storage and visualisation mechanism designed specifically for the Y
c
hromosome SNPs, containing the publicly available knowledge, would
allow researchers to obtain information for the each SNP, such as the
experimental conditions required, and thus will assist them in performing
their haplotype analysis and phylogeny constr
uction experiments.
Additionally, the same application could a
ccommodate the storage of their
own generated SNP data.

1.5 Project Approach

This project demonstrates how a web based application can provide
solutions to
the bioinformatics

issues mentioned i
n the previous two
sections. An existing visualisation tool was linked with data originating
from different sources, a flat file that was being read and a database that
was being queried.

The tasks performed comprised of surveying the relevant literature
and
web resources,
interviewing scientists and capturing the requirements,
developing a database schema, processing data from publicly available
databases, implementing the middleware software and installing and
administrating a web server
.




1

For m
ore information on why the SNPs tend to exist in binary forms, i
.
e. having two
alleles, see Brown, 2002)


11

2. Design

2.1
Preparation

2.1.1 Literature and Web Resource Survey

Before designing the web application, the literature and the internet were
surveyed to gather information regarding the human Y chromosome and
SNP web databases.

The reason for studying the Y chromosome

literature was to obtain a
greater depth on the BPG ongoing experiments and to be able to evaluate
biologically the findings emerging though the construction of the
application, e.g. the number of loci found to be associated with SNPs
stored in the databa
se.

Milestone in this background reading was the publication of an article that
analysed and interpreted from the evolutionary perspective the near
-
complete sequence of Y euchromatin, including thorough gene
identification (Skaletsky
et al
. 2003).

Queryin
g web search engines
[b]
,
[c]

and for ‘SNP database’ the results
included among the others the
Single Nucleotide Polymorphism Database

(dbSNP) (Sherry
et al
. 2001)
[d]

established by the National Center for
Biotechnology Information
[e]

and the
The SNP Consortium
(TSC) website
(The International SNP Map Working Group, 2001)
[f]
.

In addition, S
NP data were available in the
Ensembl

database (Hubbard,
T.
et al
., 2002)
[g]
, queried through
EnsMart

[h]
.

The previously mentioned web resources were used to prepare screen
shots o
f SNP mapping visualisation and information retrieval facilities
(Appendix A)

that were used in the requirement
s

capture interviews.


12

2.1.2 Requirements Capture
1

The requirements capture procedure involved interviewing molecular
biologists from the BPG and

other research groups. The interviewees had
either molecular genetics experience or an interest in bioinformatics. The
former described the specific requirements that the application should
meet, while the latter offered a more generic view of how an SNP
storage


visualisation mechanism should be implemented, focusing on
visualisation topics.

The interviewees were shown SNP records extracted from web databases
and were asked to point out which parts of information are regarded as
essential for their resea
rch and should be included in the database to be
created.

In addition, the interviewees were requested to describe the desired
appearance and the functionality of the application based on observations
and remarks on the SNP visualisation and information sc
reen dumps
mentioned in the previous section. (Appendix A)

The conclusion was the need for an application with an interactive
interface that would provide user with an overview of the SNPs in respect
to their chromosomal positions, and tools for displaying

the information
available on each of them. This interface should be scrollable and
zoomable and should accommodate an SNPs search facility. Finally, it
would be of the utmost importance to render SNPs comparatively with
other genetic elements of the same
sequence such as microsatelites and
genes, or even better gene substructures.




1

The interviews handled only the Y chromosome SNP visualisation topic.
The need for
genotype analysis data visualisation emerged after the interviews were conducted.
Since there are certain common elements in the applications that would serve these
two issues independently, they were merged into one.


13

2.2 Genotype Analysis Data Visualisation Design

In our project we decided to create an application to support genotype
analysi
s data visualisation. This was

accommodated by the
following three
tier web application
1

(Figure 2).


Figure 2
:
Client


web server architecture for the genotype analysis data visualisation

The applet on the client size
will be the visualisation tool.
The middleware

lying on the application tier will be
responsible for converting the genotype
analysis data into an applet compatible format and the data tier will cache
client data on the web server, using files.




1

The structure of a gene
ric tree tier web application is
explained in Appendix B


14

2.3 Y Chromosome SNP Visualisation

The other application our project gave rise to was a Y Chrom
osome SNP
visualisation which
is accommodated

by the following three ti
er web
application (Figure 3).


Figure
3
:
Client


web server architecture for the genotype analysis data visualisation

The applet on the client size will be the visualisation tool use
d in the
genotype visualisation. The middleware in the application tier will be
responsible for querying the database, transforming the data format, and
storing the data returned by the transformation
in an applet compatible
format.

The data tier uses a co
mbination of a database and files.


15

2.4

Y Chromosome SNP Database Data Source Design

A relational database will hold SNP information. The database
schema will

be similar to that of other
SNP databases

available on the web, but will be
modified to
suit the

specific needs of the application.

The Y chromosome SNP data will originate from flat file dumps of existing
web databases. A parser will be needed in order to read these flat files.
The parser will select the desired pieces of information and prepare scr
ipts
to load the database (Figure 4). Database loading should be performed by
a database administrator.

Parser
Database
Web
Database
Flat File 1
Web
Database
Flat File 2
Script

Figure
4
:
Filling database with data


16

2.5

Overall Application Design

Since the web applications for the genotype analysis d
ata and Y
Chromosome SNP visualisation share the same presentation tier and the
creation of an applet specific flat file in the application and data tier, they
can be merged into one application. Taking into consideration the
database creation, the whole w
eb application can be summarised as
shown on Figure 5.


Figure
5
:
Overall web application design


17

2.6

Y Chromosome SNP Database design

The conceptual schema conceived for the SNP database is illustrated in
the
Appendix
C
entity relationship diagram, and
constitutes a modified
version of the dbSNP schema
[i]
.


In dbSNP the term
SNP

refers to a broad collection of simple genetic
polymorphisms that includes, among the others: SNPs, small deletions or
insertions (a.k.a. Deleti
on Insertion Polymorphisms, DIPs), retroposable
element insertions and microsatellite repeat variations (a.k.a. Short
Tandem Repeats, STRs) (Kitts
et al.
, 2003)

In the local database the term ‘variation’ has been used instead. This is
the reason for naming

the database as
Y Chromosome Variation
Database
(YDB), since it is Y chromosome specific as well.

According to the YDB conceptual schema a variation has features such as
an internal id that is used as a unique identifier (varID), a unique dbSNP Id
(varRef
ID), allelic states, neighbo
uring sequence and, not always,

a
heterozygosity estimate. In addition it belongs to a certain class, e.g. SNP,
or DIP, or STR and has a specific type, e.g. not withdrawn or artifact.

Each variation might have a mapweight, a cod
e representing how many
times the variation occurs in the organisms genome, e.g. 1: once, 2: twice,
3: from 3 to 9 times
1
.

For every variation there is a certain validation status e.g. validated by
multiple submissions, or validated by frequency data, or e
ven not
validated. Each validation status is associated with a certain colour, a
feature that can be exploited by the visualisation tool
2
.




1

For more information on variation classes and
types check
Kitts
et al.
, 2003

and the
documentation available on dbSNP home page
[d]
.

2

For more information on possi
ble validation status

check
Kitts
et al.
, 2003


18

A variation might
be associated with one or more l
oci. Each association
has a sequential number and perhaps a type.
Each type belongs to a
certain functional class, e.g. locus when the variation lies in the locus
region, approximately to a gene feature but not in the transcript. For every
functional class YDB keeps an appropriate description. For each locus
there is an
id and a symbol stored. If the variation associated with a certain
locus is determined to be in a coding region, then the allelic state, the
reading frame, the translated amino acid residue and the position of the
amino acid in the peptide sequence may be
defined.

Finally, a variation may have a set of contig hits reporting each genomic
position predicted by
in silico

sequence analysis. For every hit the strand
and type
1

are being stored. For each contig the id and the accession
number are being stored. The

version of the contig that the variation maps
at is stored in the contig hit entity.




1
Whether it is an exact base pair position, a range of base pairs or between two base pairs


19

3. Implementation
1

3.1 YDB

Implementation

3.1.2 Implementing the Relational Schema

YDB was implemented using MySQL
2

[j]

Relational Da
taBase
Management System (RDBMS). An overview of the tables created is
available in
Appendix D
. The script file containing the Structured Query
Language (SQL) commands

to generate these tables is
CreateAllTablesAndIndices.sql
3
.


3.1.3 Inserting Data

dbSNP
4

has been used as the data source for YDB. Human Y
chromosome specific flat files were downloaded through the ftp site
[k]
.
Pieces of the information to be hosted on YDB existed in all the three
different flat file types:

the chromosome report, the
Abstract Syntax
Notation One

(ASN1) flat file, and the FastA flat file
5
.

NCBIParser
6

is a Java
[l]

command line application that parses the
aforementioned flat files, and creates the
InsertNonFixe
dData.sql
3
,
an SQL script containing the appropriate commands to load variation data
in YDB.

‘NonFixedData’ refers to the fact that the data just parsed and loaded are
those being update
d on every build of dbSNP, e.g.

variations that have
been added or rem
oved.

InsertFixedData.sql
3

contains SQL commands to load data that
remain constant in YDB such as the different variation classes and
mapweights together with their description. This file has been created



1

Arguments for or against the

implementation choices are presented in the Discussion
section of this report.

2

Versi
on 3.23.54

3

Contained in the ‘YDB’ folder of the CD

4

Build 115


5

Contained in the ‘
YDB
\
dbSNPChrYData
’ folder of the CD


6

Contained in the ‘
NCBIParser
’ folder of the CD



20

manually and complies to the relational schema of Y
DB and the context of
dbSNP.

The YDB data loading procedure is summarised in Figure 6. Figure 7 is
the NCBIParser class diagram. A characteristic of the NCBIParser is that it
implements pattern matching using regular expressions, a central feature
of Perl
programming language (Leader, 2003) that become available in the
Java 2 Standard Edition Software Development Kit Version 1.4. (Hitchens,
2002)


Figure
6
:
YDB implementation
summary



21


Figure
7
:
NCBIParser Class Diagram. NCBIParser class holds the methods

for parsing
the dbSNP flat files and creating the SQL script file. The rest of the classes are
used in creating objects that would store the information of all the variations.


22

3.2 Presentation Tier

The front end of the application is an Hyper Text Mark u
p Language
(HTML) web page available at
:
http://balabio.dcs.gla.ac.uk/pafilisv/
,

accessible by any web browser (Figure 8). An experimental version of
DerBrowser
(Grigoriev, 1998
), a Java applet, is th
e visualisation tool for
both marker genotypes and Y chromosome variations.

If the user selects to visualise a genotype analysis data file, then he/she is
prompted to upload the Excel file containing the genotyping experiment
results (Figure 9).

If the us
er selects to view Y chromosome variations then YDB is being
accessed and the visualisation tool invoked.


Figure 8
: Visualisation Mechanisms Supporting Genotype Analysis, entry page.


23


Figure 9
: Data file upload interface


3.3 Application Tier

3.3.1 Ge
notype Analysis Data Visualisation Middleware

After the Excel file has been submitted
,

a Hyper Text Transfer Protocol
(HTTP) POST request invokes a Common Gateway Interface (CGI)
[
m
]

script written in Perl language

[n]
1
. Figure 10 summarises the series of
actions performed by this script.




1

Contained in the ‘WebPage
\
MarkerMapVisualisation
\
CGI’ folder of the CD


24


Figure
10
:
Implementing marker genotype visualisation business logic

Initially the Excel file is saved in the server side. Then a Perl script, written
by
Ashley
Pond
[o]
1
, reads the Excel file and converts it into a tab
delimited text file. Subsequently, the tab delimited text file is being parsed
by JMarkerMapParser
2
, a Java command line tool and a DerBrowser
compatible flat file is b
eing created. Finally, the Perl


CGI script returns
an HTML page (Figure 11) where the marker genotypes are being
rendered by DerBrowser.




1

Containe
d in the ‘WebPage
\
MarkerMapVisualisation
\
XLS2TXT’ folder of the CD

2

Contained in the ‘
JMarkerMapPars
er

folder of the CD


25


Figure
11
:
Genotype Analysis Data Visualisation. The colour of each box corresponds to
a different genotype accordi
ng to the colour code shown on top.

The scale on
the left hand side corresponds to the physical chromosome length. Individual
rat strains are shown as columns.

3.3.2 Y Chromosome Variation Visualisation Middleware

The visualisation of Y chromosome variatio
ns logic is implemented by a
servlet
[p]
, a J
ava server
-
side program that contains methods to respond
to HTTP requests, to connect to databases and to return a new HTML
page.


26

CreateDBrFlatFileServlet.java
1

implements the f
unct
ionality
displayed in Figure 3.

It connects to YDB using Java DataBase
Connectivity (JDBC)
2

[q]
, queries the name and position and validation
status colour of every Y chromosome variation
(Appendix E)
, writes these
data in a D
erBrowser compatible file and returns an HTML page (Figure
12) with the visualisation.

In order for the servlets to be functional, they must be contained inside a
servlet engine running on a web server. Jigsaw 2.2
[r]
, both a we
b server
and a servlet engine,
was used.


Figure 12
:
Y Chromosome Variation Visualisation
, with the physical chromosome scale
on the left, and SNPs shown as black boxes in the first column.




1

Contained in the ‘
WebPage
\
YDBServlets
’ folder of the CD

2

JDBC, the application programming interface for connecting Java programs w
ith
database systems


27

3.4 Implementation Summary

The implementation procedure mentione
d in the previous sections is
summarised in Figure 13.


Figure 13
:
Overall Implementation Summary


28

4. Testing and Evaluation

4.1 Testing

Throughout the whole project comprised of a series of incremental steps.
At each stage the correctness of newly added
code was tested, using
manual comparison of input and output data.

The dbSNP flat files have been checked for either missing or invalid data,
e.g. a missing value or a word instead of a number. This check was
performed partially during the parsing and part
ially during the YDB data
loading.

Another possible source of errors is the format of the uploaded Excel file
containing marker genotype data. On the relevant web page (Figure 9)
there are clear directions regarding the specifications that the Excel file
s
hould comply with. A file that can be used as template is available for
downloading as well.

For both NCBIParser and JMarkerMapParser there were no synthetic
control data samples available. As an alternative, flat files were generated
containing every diff
erent combination of data that should be dealt with.
The parsers were fine tuned using these synthetic data files and t
hen
applied to the real data.

The servlet associated with the Y chromosome variation visualisation was
initially constructed
using a stan
d
-
alone Tomcat 4.0

[s]

servlet engine and
was transferred to Jigsaw after it has been debugged. Prompts have been
added to server command line output in order to provide web server
administrator with proper feedback (Figure 14).

Finally, when the
application was released, it was tested
at the

BPG location in the Western
Infirmary in order to identify any possible access or configuration issues.


29


Figure 14
: Providing administrator with feedback on the servlet execution

4.2 Evalua
tion

A molecular biologist and a bioinformatician of the
BPG evaluated

the web
application. The
y

were given some time to test both the marker genotype
and the Y chromosome variation visualisation and they were asked a
number of specific questions
(Appendix

F)

focusing on user friendliness,
visualisation issues
, problems

and deficiencies, and suggestions for
further improvement.



30

5. Discussion

5.1 Design Decisions

5.1.1 Only dbSNP as the Data Source for YDB

The initial concept was that the YDB data would o
riginate from more than
one web database. Considering the time limit for the completion of the
project, we decided on building a prototype with equally developed
components and propose data integration from several web databases as
a future extension
1
.

Th
e selection of dbSNP as the data source is supported by the fact that
dbSNP is integrated with other large public databases,
via the

LocusLink
database (Sher
r
y et al., 2001). The existence of such links can assist the
identification of entries in different

databases that refer to the same
variation and, thus, support data integration.

The plethora of documentation and guidance available on the dbSNP web
site is another reason for selecting dbSNP. The topics ranged from the
biological significance of genome
variation to technical details of the
database, easing this way the gathering of background information and
database design.

The fact that dbSNP flat files are divided by chromosome, suited the
purpose of the application to focus only on Y chromosome.

5.1
.2 Not only SNPs

The information on types of genomic variation other than SNPs could have
been disregarded, since BPG researchers are mainly interested in SNPs.
However we decided on retaining it because, as will be demonstrated in
the ‘Further Improvemen
t’ section that DerBrowser can

display these extra
pieces of information in a biologically significant way.




1

Data integration as a further improvement is being presented in the next chapter of the
report


31

5.2

Implementation Decisions

5.2.1 MySQL as the RDBMS

MySQL has been selected as the RDBMS because it offers stability,
support, and low cost (
Sueh
ring
, 2002)
. This means that if e.g. BPG want
to install MySQL locally that will require neither significant funds, nor
administrative personnel.

MySQL advantages and disadvantages compared to other RDBMSs are
being disp
layed in T
able 1.


Table 1
: Compari
son of SQL Implementations
1

5.2.2
Parsers in the Java Language

The parsers needed for the application have been written in Java. This is
due

to our familiarity with this l
anguage which exploits the advantages of
object oriented programming, and allowed us
to create easily the required
data structures and algorithms.




1

From Suehring, 2002, TCO is the abbreviation of Total Cost of Ownership


32

5.2.3 Both CGI and Servlets as Middleware Technology
1

Both marker genotype and Y chromosome variation visualisation
middleware technology could have been implemented with either CGI or
Servlets
.

The older

and slower
2

CGI, written in Perl, (Speegle, 2002) was selected
to implement

marker genotype visualisation.

The fact that no complicated
tasks, e.g. database connectivity, were required in this part of our work,
combined with the ease offered by

Perl in invoking command line
applications and scripts, were the reasons for this choice.

For the more demanding Y chromosome variation visualisation Java
Servlet technology has been used. Since servlets are Java programs, they
have access to the entire f
amily of Java Application Programming
Interfaces and receive all the benefits of the Java language including
portability and crash protection.

In addition, because one of the application requirements is the interaction
between the applet (DerBrowser) and
the database (YDB), servlets can be
used in order to guarantee safety and to overcome security issues
(Speegle, 2002).

5.2.4 DerBrowser as the Visualisation Tool

DerBrowser is a visualisation tool that is conforms to the
requirements we

identified during o
ur project. It is an applet designed to display objects on a
genomic map, it can query a data source about a selected object, find an
object by name, and provides the

user with zooming and scrolling
capabilities
.

In addition, DerBrowser is written is an ol
der version of Java
and thus is supported b
y most of the web browsers
.




1

A comparison of middleware technologies is available at Speegl
e, 2002 Preface pages x, xi

2

Slower, since it greatly increases the number of processes running on a server.


33

5.3 Application Criticism: Achievements and Drawbacks

The data processing and visualisation mechanisms developed within the
scope of this project performed well only in providing BPG r
esearches with
visualisation tools for viewing their genotype analysis results. As shown by
Figure 11, the desired result of viewing marker colour
-
coded genotypes, in
respect to the marker physical location has been achieved. However, it
has to be mentione
d that extensive use of zooming and scrolling, together
with an ‘artificial increase’ in the length of the markers
1

were necessary.


Figure 15
: Y Chromosome Variation Visualisation

in the highest zoom possible




1

Markers by definition indicate a certain point in the genome. In order to improve
visualisation they have assigned range whose centre is the gen
omic position that they
indicate.


34

On the contrary the Y chromosome variation vi
sualisation (Figures 12 and
15)
,

did not have the expected outcome. Most of the variation names did
not appear at all and many variations due to their proximity appeared as
smear. In addition the requirement of querying YDB and retrieving
variation specifi
c information has never been met.

However, from a broader point of view, the achievement was the
implementation of an application that can serve more than one purpose
and
provides a fundamental storage and visualisation facility that can be
extended in man
y directions.

5.4 Further Improvement

This application demonstrated clearly that DerBrowser, proves to be more
useful in displaying large objects rather than small genomic variations.
There is
a
need towards increasing the zooming capacity, something that
should be accompanied by a smooth scrolling behaviour.

The name display perhaps should be reconsidered, along with the
suggestion of colouring not only the genetic objects but their labels as well.

Before viewing Y chromosome variations, the user should be

prompted to
specify what exactly he/she wishes to view and in which way
,

through drop
down menus, radio
-

and check
-
buttons.

For example he/she might select only the validated variations, or the
variations that exist in
a
certain region, or have a certain

degree of
heterozygosity. Even further he/she may choose to view

a certain kind of
variation (e.g.

SNPs) in one of DerBrowser stripes a
nd another kind of
variation (e.g.

microsatelites) in the adjacent stripe.


35

The latter could be of utmost importance. If
the user could select to view
concurrently certain genomic variations, such as SNPs, in one stripe and
other genetic elements, such as genes, in the next one, a better
understanding of the positions and significance of the SNPs could be
gained. However, th
is implies that the database should contain all those
pieces of information, and thus would constitute an integrated biological
database rather than a database of genomic variation. The data
integration issue, though, despite of the efforts to solve it, wi
ll remain a
difficult problem for the conceivable future (Stein, 2003).


36

6. Conclusion

This project demonstrated how the features of multi
-
tier web applications
can be exploited in order to serve the purposes of bioinformatics. The
separation between the
presentation layer, application logic and the data
storage layer
,

supported the segmentation of the problem into smaller
parts and assisted the implementation of components which deal with
each one of the layers specifically.

Existing visualisation tools a
nd web interfaces were used in order to
present biological data in a more explicit manner than would be possible
within a spreadsheet program. However, the existing software had
to be

modified and adjusted so that
it could serve

the exact needs of the user
,
as defined during the software requirements analysis
.

A variety of middleware technologies were
used to

implement the
application logic. Those included

CGI
scripts
and Java Servlets
.

We used a
MySQL
database

management
system
to

manipulate biological
dat
a. We identified further issues in data integration that would provide the
biologists with valuable complementary and comparative information on
genotypes and genetic vari
ation.

Implementing a web application in order to satisfy the needs of the B
PG

resear
ch group was not only a software engineering task but required a
significant amount of problem solving and intellectual scrutiny of the data
and the research practice.


37

References

A.
Literature



Brown,T.A. (2002)
Genomes Second Edition
. BIOS, pp: 131, Box 5
.1.



Charchar F.
et al
.

(2002)
The Y Chromosome Effect on Blood
Pressure in Two Eu
ropean Populations
.
Hypertension
,
39: 353
-

356.
[HTML]



Grigoriev, A.
et al.

(1998)
Distributed environment for physical map

construction
.
Bioinformatics
, 14, 242
-
258



H
itchens, R
. (2002)
Java™ NIO
.
O’Reilly
,
ISBN: 0
-
596
-
00288
-
2
.



Hubbard, T.
et al
. (2002)
The Ensembl genome database project
.

Nucleic Acids Research

30, 1, 38

41.



Hunter, B. (2003)
Gene Visualisation And Comparison Tool
.
Final
Year Project Report
, Dept. Of C
omp. Science, Uni. Of Glasgow, pp: 3,
4.



Jobling, M., Tyler
-
Smith,C. (2003)
The Human Y Chromosome: An
Evolutionary Marker Comes Of Age
.
Nature

424, 598

612



Kitts, A., Sherry, S. (2003)
The Single Nucleotide Polymorphism
Database (dbSNP) of Nucleotide Sequ
ence Variation
.
NCBI
Handbook
, Chapter 5.

[HTML]



Leader, D. (2003)
David’s Perl
. Perl Module Notes MRes
Bioinformatics Glasgow University.



Ogren
,
M. (2003)
Whole
-
Genome SNP Genotyping
.
The
Scientist
,
17
,11, 42.
[
HTML]



Rozen, S.

et al
. (2003)
Abundant gene conversion between arms of
massive palindromes in human and ape Y
chromosomes
.



Sherry, S
.
T

et al
.

(2001)
dbSNP: the NCBI database of genetic
variation
.

Nucleic Acids Research
.
29:308
-
311.


38




Skaletsky, H.
et al
. (2003)
The male
-
specific region of the human

Y
chromosome: a mosaic of discrete sequence classes
.
Nature

423,
82
5

837.



Speegle, G. (2002)
JDBC: Practical Guide For Java Programmers
.
Morgan Kaufman Publishers, ISBN: 1
-
55860
-
736
-
6



Stein, L. (2003)
Integrating Biological
Databases
.

Nature Reviews
Genetics

4, 337
-
345.



Suehring

S. (2002)
MySQL Bible
. Wiley, pp: 7
-
14,
ISB
N: 0
-
7645
-
4932
-
4



The International SNP Map Working Group
. (
2001
)

A map of human
genome sequence variation containing 1.4 million SNPs
.
Nature

409,
928

933
.


39

B.
Internet Resources

[a] British Heart Foundation Blood Pressure Group:
http://www.medther.gla.ac.uk/bhf/index.htm


[b] Google Search Engine:
http://www.google.com

[c] Scirus Scientific Information Search Engine:
http
://www.scirus.com

[d] Single Nucleotide Polymorphism Database (dbSNP) Home Page:
http://www.ncbi.nlm.nih.gov/SNP/

[e] National Center for Biotechnology Information Home Page:
http://www.ncbi.nlm.nih.gov/

[f] The SNP Consortium (TSC) Home page:
http://snp.cshl.org/

[g] Ensembl Genome Browser:

http://www.ensembl.org

[h] Ensmart Data Retr
ieval Tools Set:
http://www.ensembl.org/EnsMart/

[i] dbSNP Schema:
ftp://ftp.ncbi.nih.gov/snp/mssql/schema/erd_dbSNP.pdf

and changes
:
http://www.ncbi.nlm.nih.gov/SNP/snp_schemaChange.htm

[j] MySQL Home Page:
http://www.mysql.com

[k] dbSNP FTP site:
ftp://ftp.ncbi.nih.gov/snp/

[l] The sour
ce for Java
Technology
http://java.sun.com/

[m] CGI Information Page:

http://hoohoo.ncsa.uiuc.edu/cgi/


[n] Perl Home Page:
http://
www.perl.com/

[o]
Excel to Tab delimited Text File Converter
http://sedition.com/perl/excel_to_delim.html

[p] Java Servlet Technology:
http
://java.sun.com/products/servlet/

[q] JDBC Technology:
http://java.sun.com/products/jdbc/

[r] Jigsaw Home Page:
http://www.w3.org/Jigsaw/

[s]
Tomcat Home page:
http://jakarta.apache.org/tomcat/



40

Appendi
x A

Screen shots of SNP mapping visualisation and information retrieval
facilities.


Ensembl (Ensmart)


41



SNP ‘rs3848982’ following the link to UCSC Genome Bro
wser from the dbSNP record


42

Appendix B

Structure of a generic tree tier web application.


Abstract structure of a three tier web architecture
. There is a clear distinction among
the presentation layer that displays data to the user, the application layer
that implements
business logic and a data layer where data are being stored.


43

Appendix C

The conceptual schema of YDB
.

Variation
varID
sequence
Contig
Locus
VariationClass
VariationType
ValidationStatus
MapWeight
varRefID
alleles
hetSError
hetero
is
N
1
has
N
1
has
N
1
has
1
Chromosome
belongsTo
N
varClassName
varClassDesc
varTypeName
varTypeDesc
valStatus
ID
valStatus
Abbr
mapWeightID
mapWeight
Description
valStatus
Desc
valStatus
Colour
chrID
endPosCtg
startPosCtg
Strand
LocationType
1
contigVersion
contigID
contigAcc
strSymbol
strDescr
locatType
Name
locusID
locusSymbol
varAllele
readFramePos
varResidue
aaPosition
FunctionClass
fxnClassName
fxnClassDesc
Associated
Locus
has
regarding
has
CtgHit
regarding
has
N
has
has
1
N
1
N
1
N
N
N
1
1
assocSeq
No
ctgHitID
locatTypeID
maps
1
N
endPosChr
startPosChr
1
N
1
N


44

Appendix D

Overview of YDB Relational Schema

1.
VariationClass

(
varClassName
, varClassDescription)

2.

Varia
tionType

(
varTypeName
, varTypeDescription)

3.

MapWeight

(
mapWeigthID
, mapWeightDescription)

4.

ValidationStatus

(
validStatusID
, validStatusAbbr
,
validStatusDescription,
validStatusColour
)

5.

FunctionClass
(
fxnClassName
, fxnClassDescription)

6.

Variation

(
v
arID
, varRefID, varClass, alleles, hetero, hetSError,
validStatus, varitype, varMapWeight, sequence)

7.

Locus
(
locusID
, locusSymbol)

8.

Strand
(
strSymbol
, strDescription)

9.

Contig
(
contigID
, contigAcc)

10.

AssociatedLocus
(
assocSeqNo
, asVarID, asLocusID,
fxnClass,
readFramePos, varAllele, varResidue, aaPosition)

11.

CtgHit
(
ctgHitID
, variationID, ctgID, ctgVersion, startPosCtg,
endPosCtg, ctgStrand, ctgLocationType, chrom, startPosChr,
endPosChr)

12.

LocationType
(
locatTypeID
, locatTypeName)


45

Appendix E

SQ
L Query

SELECT startPosChr, endPosChr, varRefID, varID,
validStatusColour

FROM Variation, CtgHit, ValidationStatus

WHERE startPosChr <> 0


AND varID = variationID


AND validStatus = validStatusID;


46

Appendix F

Questionnaire


Marker Map Visualisati
on

How easy was it for you to use the module? (Please tick as appropriate)

Were the titles used the appropriate ones? (Yes/No)

If
No
please suggest some alte
rnative ones:
……….…………………………...
…………………
……………………………
………………………………………
………………………………………………

How informative was the visualisation?

Did you identify any problems or malfunctions?
………………………………………………………………………………………
….……………………………………………………………
……………………
..

………………………………………………………………………………………

What would you like to suggest as an amendment or a further
improvement?

………………………………………………………………………………………
………………………………………………………………………………………
……
…………………………………………………………………………………

Very easy

Relatively
easy

Intermediate

Hard

Very hard






Very much

Adequately

Intermediate

Inadequately

Not at all







47

Y Chromosome Variation

How easy was i
t for you to use the module? (Please tick as appropriate)


Were the titles used the appropriate ones? (Yes/No)

If
No
please suggest some altern
ative ones:
……….…………………………...………
………………
...
………………………
……
…………………………………………………………………………………

How informative was the visualisation?


Did you identify any problems or malfunctions?
………………………………………………………………………………………
….…………………………………………………………………………………
What

would you like to suggest as an amendment or a further
improvement?

………………………………………………………………………………………
………………………………………………………………………………………
………………………………………………………………………………………

Would you support the creation of a
Y Chromosome Integrated Database
?
Please support yo
ur answer with a few points or examples.

………………………………………………………………………………………
………………………………………………………………………………………
Very easy

Relatively
easy

Intermediate

Hard

Very hard






Very much

Adequately

Intermediate

Inadequately

Not at all







48

Both Applications

DerBrowser

was not originally built in order to render genetic markers. Are
satisfied with its performance? Please support your an
swer with a few
points or examples.

………………………………………………………………………………………
………………………………………………………………………………………
………………………………………………………………………………………
……………………………………………………………………
…………………


Do y
ou think that the visualisation software should be extended further?
Please suppor
t your answer with a few points or examples.

………………………………………………………………………………………
………………………………………………………………………………………
………………………………………………………………………………………
………………………………………………………………………………………