ppt

disturbedtonganeseBiotechnology

Oct 2, 2013 (3 years and 10 months ago)

100 views

NGS Bioinformatics
Workshop

1.3
Tutorial
-

Sequence
Alignment and
Searching


March
22
nd
,
2012

IRMACS
10900

Facilitator: Richard
Bruskiewich

Adjunct Professor, MBB

Learning Objectives


A few more Linux tips


FASTA and BLAST on the web


Local BLAST


Local installation of BLAST


Making a blast database


Running a local blast query


Parsing search results using a script


First, a few more Linux tips…


“cat”, “more” or “less”:
list contents of a (text) file


“head” or “tail”:
display the start or end of a file


One can ‘redirect’ the contents of a program into or out of
a file:


‘<‘ for input (‘<<TAG‘ appends from keyboard)


‘>’ for output (‘>>’ appends rather than overwrites)


TRY:

cat >myfile.txt <<EOF

This is my file!

Another line

EOF




more” or “less”:
like “
cat
” but controllable


More….



wget
”:
downloads an internet file


“which”:
displays where
a program is
located on
the system



mkdir
”:
makes a directory



cp
”:
copies files or
directories


“mv”:
moves a file or directory



ln
”:
creates alias names/locations to files


ln


s
source

target

#
-
s link “symbolic”


e.g.
ln


s ncbi
-
blast
-
2.2.26+
ncbi


“export”:
exposes an environment variable, e.g.


export
BLAST=/
usr
/local/
ncbi


Environment variables provide general configuration and global
contextual information that operating system scripts and
computer programs can read if they need to know about such
context.

File archives (and programs)


“.
gz
” files:
gzip
,
gunzip

archive commands


“.tar” files:
tar command


Flags:


-
x #
extract


-
f filename # file to extract


-
v # verbose output


-
z #
uncompress

gz

file on the fly…


e.g.


tar

xvf

file.tar


Tar

zxvf

file.tar.gz


“.bz
2
”:
bzip
2
and bunzip
2
archive commands


“.zip”:
zip and unzip
archive commands

FASTA Walkthrough

Database: Knowledgebase
(complete source for protein
information, combines other
databases), Swiss
-
Prot (only
manually
-
curated proteins)



Paste your sequence here!!!!



Program: fasta (DNA or protein
sequence against a like
database)



Matrix: scoring matrix used

(first click More options… )


Results: interactive or email




Don’t forget to push SUBMIT!

http://www.ebi.ac.uk/Tools/sss/fasta
/


http://
blast.ncbi.nlm.nih.gov


BLAST Walkthrough

Note: the program used;
nucleotide blast (
blastn
), protein
blast (
blastp
), etc. is chosen at
an earlier screen


Paste your sequence here!!!



Database: nr,
swissprot
, etc.


Algorithm:
blastp

(protein blast)



To configure blastp options,

click Algorithm parameters.

BLAST Walkthrough

Algorithm paramters



Expect value: can be changed




Matrix



Filtering: Off by default, can turn
on


Don’t forget to push BLAST!

Graphical representation of
the alignment results; each
line represents an
alignment; color indicates
similarity

List of the hits from the
database you searched
against (ID, name, E
-
value
of top HSP)



click on score to jump
down to textual alignment

Individual display for each
alignment (HSP)

BLAST Results

Protein domains in your
sequence

http
://
www.youtube.com/watch?v=HXEpBnUbAMo&feature=related


You can also use a Script to directly do the
BLAST’ing



e.g.
Biopython

Example 1:

If
you have a nucleotide sequence you want to search against the
nucleotide database (
nt
) using BLASTN, and you know the GI
number of your query sequence, you can use
:


from
Bio.Blast

import NCBIWWW

result_handle

=
NCBIWWW.qblast
("
blastn
", "
nt
", "8332116")


Then, save the result…


save_file

= open("my_blast.xml", "w
") # save as XML format

save_file.write
(
result_handle.read
())

save_file.close
()

result_handle.close
()


More
BLASTing


Example 2:


Alternatively
, if we have our query sequence already in a FASTA
formatted file, we just need to open the file and read in this
record as a string, and use that as the query argument
:


from
Bio.Blast

import NCBIWWW

fasta_string

= open("
m_cold.fasta
").read()

result_handle

=
NCBIWWW.qblast
("
blastn
", "
nt
",
fasta_string
)


See
http
://
biopython.org/DIST/docs/tutorial/Tutorial.html
for
more sample BLAST query code (or see the equivalent sections
of other open
-
bio toolkits)

Locally installed BLAST
-

Advantages

Search many input sequences at once


Customizable databases


No need for internet access


Faster for small to medium
-
sized databases


Integrate BLAST searches into a larger, automated
bioinformatics analysis


Can also run local BLAST through open
-
bio scripts (e.g. see
http
://
biopython.org/DIST/docs/tutorial/Tutorial.html
)




Your

Query

Sequence

BLAST Programs

File downloaded from

NCBI contains:


blastp


blastn


blastx


tblastx


rpsblast


makeblastdb


etc…

BLAST

database

makeblastdb

FASTA file

containing sequences

to build database to
BLAST against (NCBI or
your own file)

Output

Parts of the standalone
BLAST equation

Let’s try this ourselves….

We will:


Obtain and install the BLAST
executables

(Linux)


Set up a BLAST database


Copy an archive


Use ‘
makeblastdb

to create a novel database to search
against


Use the


blastp

program to carry out a

BLAST
analysis over the command
-
line


Output your
BLAST results into a more flexible
format


Use a small
BioPython

script to parse the output

Step 1


Installing BLAST tools


Go to
http://
blast.ncbi.nlm.nih.gov

and follow
links to latest release and follow instructions for
favorite operating system:


Windows 32/64:
.exe installer


Linux 32/64:
compiled binaries (RPM or tar.gz)


Other Unix:
compiled
binaries (in tar.gz)


Apply platform
-
specific configuration details for
your operating system


Read the good documentation:

http://www.ncbi.nlm.nih.gov/books/NBK1763
/


Installing from Linux
.tar.gz
archive

wget

ftp://
ftp.ncbi.nlm.nih.gov/blast/
executables
/


扬b獴
⬯+䅔E協S湣扩
-
扬a獴
-
㈮㈮2㘫
-
砶x
-
l楮i砮t慲a杺

tar
-
zxvf

ncbi
-
blast
-
2.2.26+
-
x64
-
linux.tar.gz

sudo

mv ncbi
-
blast
-
2.2.26+ /
usr
/local

cd /
usr
/local/

sudo

ln

-
s ncbi
-
blast
-
2.2.26+
ncbi

export BLAST=/
usr
/local/
ncbi

export PATH=$BLAST/bin:$PATH


Step
2


Getting a BLAST database


Option A:


Pre
-
formatted databases:
download archives
from NCBI:
ftp
://ftp.ncbi.nlm.nih.gov/blast/db
/


Option B:


“Roll your own”:
construct
de novo
from a file of
custom FASTA sequences using
makeblastdb


Option A: Obtain a existing NCBI database... (Linux)

sudo

mkdir

$BLAST/
db


wget

ftp://
ftp.ncbi.nlm.nih.gov/blast/db/swissprot.tar.gz

tar
-
zxvf

swissprot.tar.gz
$
BLAST/
db

cd


Option B:

Create a simple BLAST database
from a local FASTA sequence file

The
makeblastdb

application produces
BLAST databases
from
FASTA files. In the
simplest
case the FASTA definition lines are not
parsed by
makeblastdb

and may be completely
unstructured
(but can only be
BLAST’ed

and not be directly retrieved)


makeblastdb

-
in
mydb.fsa

-
dbtype

nucl


creates
a BLAST database from
a nucleotide
FASTA
sequences
which can be put into the “
db
” directory for searching. Of course,
like all blast programs, there are a rich set of parameters which
can be used to customize the generation of the database (see the
BLAST manual).

With either option, still need some
configuration details

cd # back to home directory

# need to point to the database…

cat
>.
ncbirc

<<EOF

[BLAST]

BLASTDB=/
usr
/local/
ncbi
/
db

EOF


Step
3


Executing a BLAST operation


Command line programs (only) but
parameters are generally equivalent to (or a
superset of) the NCBI web BLAST application


Sample run:


Retrieve a sequence from the database:

blastcmd


db

swissprot


entry
Q9MAH0

outform

"%f"

out test_query.txt


Blast it back against the same database:

blastp


query text_query.txt

db

swissprot


out
output.txt

outfmt

5

Different BLAST
programs

blastn



search nucleotide database using a nucleotide query

blastp



search protein database using a protein query

blastx



search protein database using a translated nucleotide
query

tblastn



search translated nucleotide database using a protein
query

tblastx



search translated nucleotide database using a translated
nucleotide query

psiblast



Position
-
Specific Iterated BLAST

rpsblast



Reversed Position Specific BLAST


See BLAST documentation for related utility programs







Command line parameters:
statistics


-
evalue
: expect value, normally set to 10

-
word_size
: “k
-
tuple
” size; increase for speed, decrease
for sensitivity

-
gapopen
: cost to open a gap; increase for stringency

-
gapextend
: cost to extend a gap; increase for stringency

-
matrix
: substitution scoring matrix (default BLOSUM62);
change if sequences too related or too distant

To get more information use option “
-
help”




Command line
parameters: input/output

-
query

in.txt: specify input file

-
out

out.txt: specify output file

-
db

nr: which database (created with
makeblastdb
)

-
dust

yes/no: filter low complexity regions in nucleotide
sequence search yes/no (default is
yes
)

-
seg

yes/no: filter low complexity regions in protein sequence
search yes/no (default is
no
)

-
html
: format output as HTML

-
outfmt
:
specify output format, e.g. 5 = XML blast output

(use

help
flag to see other options)


Additional useful program options

Depending
on
program:

-
num_threads
: use multiple CPUs (speeds up search)

-
subject
: specify a second input sequence instead of a database (former
‘bl2seq’)

-
task
megablast
:
much

faster for highly similar nucleotide sequences

-
task
blastn_short
: find similar
short

sequences (e.g. primer sequences)








Step 4


Parse the output


If you just have one query sequence, simply
view the BLAST text file


If you are doing a lot of queries on the
database and looking for “best hits”, you may
wish to use a parsing script (e.g.
biopython

or
equivalent)

Parsing BLAST output with
Biopython


Good to use the BLAST XML format for this…


result_handle

= open("my_blast.xml
")




Now
that we’ve got a handle, we are ready to parse the
output. The code to parse it is really quite small
.



1.
If
you expect a single BLAST result (i.e. you used a
single query):


from
Bio.Blast

import NCBIXML

blast_record

=
NCBIXML.read
(
result_handle
)

More parsing…

2.
or
, if you have lots of results (i.e. multiple
query sequences):


from
Bio.Blast

import NCBIXML

blast_records

=
NCBIXML.parse
(
result_handle
)

for
blast_record

in
blast_records
:

...
# Do something with
blast_record

What’s in a BLAST record?

E_VALUE_THRESH
=
0.04

for
alignment in
blast_record.alignments
:



for
hsp

in
alignment.hsps
:


if
hsp.expect

<
E_VALUE_THRESH:



print
'****Alignment
****‘



print
'sequence:',
alignment.title



print
'length:',
alignment.length



print
'e value:',
hsp.expect



print
hsp.query
[0:75] +
'...‘



print
hsp.match
[0:75] +
'...‘



print
hsp.sbjct
[0:75] + '...'

Gives output something like this…

****Alignment
****

sequence
: >gb|AF283004.1|AF283004 Arabidopsis thaliana cold acclimation
protein WCOR413
-
like protein

alpha
form mRNA, complete
cds


length
: 783

e
value: 0.034
tacttgttgatattggatcgaacaaactggagaaccaacatgctcacgtcacttttagtcccttacatattcctc...
||||||||| | ||||||||||| || |||| || || |||||||| |||||| | | ||||||||
| |||
||...
tacttgttggtgttggatcgaaccaattggaagacgaatatgctcacatcacttctcattccttacatcttcttc
...


A
gain, see
http://biopython.org/DIST/docs/tutorial/Tutorial.html
for more details...