Bioinformatics PM

tastelesscowcreekBiotechnology

Oct 4, 2013 (3 years and 2 months ago)

73 views

Bioinformatics (continued)

This session:

Sequence similarity (BLAST)

Sequence alignment

Genome Browsers (ENSEMBL
)

Biomart

(R functions)

Why analyse DNA sequences

To understand evolutionary relationships

(systematics)

To understand evolutionary processes

(e.g. selection, drift)

To predict gene function

To determine gene structure

(e.g. coding regions, introns etc)


Sequence similarity

ATCGTGGTCTGCCTG

|||||| |||||||

ATCGTG
-
ACTGCCTG

ATCGTGGTCTGCCTG

|||||| |

ATCGTGACTGCCTG
-

The most important bioinformatics tools deal with pairwise
alignments between two sequences (amino acid or DNA).


Sequence alignment ≈ database searching


Alternative alignments of two sequences. The one on the left is
‘better’. Why? How do we formalise this process?

Sequence alignments (continued)

To work out the best possible alignments we
can award scores for:

Matches

A



|



A


Mismatches

A



T


Gaps


A



-

Score = +3

Score =
-
1

Score =
-
5

ATCGTGGTCTGCCTG

|||||| |||||||

ATCGTG
-
ACTGCCTG



3

3

3

3

3

3

-

5

-

1

3

Score = 3+3+3+3+3+3
-
5
-
1
+3+3+3+3+3+3+3 = 33

ATCGTGGTCTGCCT

|||||| |

ATCGTGACTGCCTG

Score = 3+3+3+3+3+3
-
1
-
1
-
1
-
1
-
1
+3
-
1
-
1

= 14

Calculating Match Scores

BLAST

B
asic
L
ocal
A
lignment
S
earch
T
ool

Identifies parts of sequences that are highly similar

Starts by identifying a region of defined size (usually 11bp) that is 100%
identical. Known as
wordsize

ATGGTGACTGC
CCCTAATGCAT

|||||||||||
| ||| |||||

ATGGTGACTGC
CTCTAGTGCAT

************ *** *****

BLAST extends the alignment from this 11bp sequence until the match score begins
to decline

BLAST on the NCBI site

http://blast.ncbi.nlm.nih.gov/Blast.cgi

Where you paste the
sequence

Database you wish to search
(default is human, NR useful)

Option to restrict organisms
searched

Some common parameter
settings

Entering sequences into BLAST

Further optimisation of
parameter settings

BLAST results

Graphical display of the
quality of the hit

BLAST results (cont)

List of matching
sequences

GenBank

identifier

Name of sequence

Bit Score (measure of
similarity)

Expectation Value
(analogous to a P value)


G

U

Gene

UniGene

Gene expression
database

BLAST results (cont)

Expect Score

(lower the better)

Bit Score

(higher the better)

Query sequence

(our sequence)

Subject sequence

(the one it matched)


Sequence Alignments

atggctatttgaccatga

ctatttgaccatgattg

Very often 2 sequences are highly similar but start at different positions

They give the appearance of being dissimilar

Solution:

to
align
them


Can be done manually (sometimes) or computationally

Sequence Alignments

atgg
ctatttgaccatga


ctatttgaccatga
ttg

Very often 2 sequences are highly similar but start at different positions

They give the appearance of being dissimilar

Solution:

to
align
them


Can be done manually (sometimes) or computationally

http://www.ebi.ac.uk/clustalw/

Alignments with ClustalW

Sequences can be
pasted or uploaded

ClustalW output


i

A list of pairwise comparisons between
all sequence combinations


Length of alignment


% similarity

ClustalW output


ii

Note: not in fasta format

ClustalW output


iii

Really just a visual aid

ClustalW output


iv

Jalview

Tool to visualise, edit and export
alignments


Genome Browsers

Tool for:

Examining particular parts of the genome

BLASTing sequences against whole genomes

http://www.ebi.ac.uk/ensembl/

http://genome.ucsc.edu/cgi
-
bin/hgGateway

Blasting against Genome Browsers

Relates our sequence to a specific region of the genome

Can identify linked genomic features

Finally: limitations of web software

BLAST

Time
-
consuming

Limited tweaking of input parameters

Limited no. of files BLASTed at a time

Cannot customise databases


Solution:

Stand
-
alone BLAST

ClustalW

Can be slow






Solution:

ClustalX


http://www
-
igbmc.u
-
strasbg.fr/BioInfo/ClustalX/Top.html