Introduction to Python programming for biologists

gorgeousvassalΛογισμικό & κατασκευή λογ/κού

7 Νοε 2013 (πριν από 3 χρόνια και 7 μήνες)

71 εμφανίσεις

Lecture: Introduction to
BioPython


MARC: Developing Bioinformatics Programs

July 2009



Alex
Ropelewski

PSC
-
NRBSC


Bienvenido

Vélez

UPR Mayaguez



Introduction to Python programming


for biologists

1


The following material is the result of a curriculum development effort to provide a set
of courses to support bioinformatics efforts involving students from the biological
sciences, computer science, and mathematics departments. They have been
developed as a part of the NIH funded project “Assisting Bioinformatics Efforts at
Minority Schools” (2T36 GM008789). The people involved with the curriculum
development effort include:



Dr. Hugh B. Nicholas, Dr. Troy Wymore, Mr. Alexander
Ropelewski

and Dr. David
Deerfield II, National Resource for Biomedical Supercomputing, Pittsburgh
Supercomputing Center, Carnegie Mellon University.


Dr. Ricardo
González

Méndez
, University of Puerto Rico Medical Sciences Campus.


Dr.
Alade

Tokuta
, North Carolina Central University.


Dr. Jaime
Seguel

and Dr.
Bienvenido

Vélez
, University of Puerto Rico at
Mayagüez
.


Dr.
Satish

Bhalla
, Johnson C. Smith University.



Unless otherwise specified, all the information contained within is Copyrighted © by
Carnegie Mellon University. Permission is granted for use, modify, and reproduce
these materials for teaching purposes.


Most recent versions of these presentations can be found at
http://marc.psc.edu/



What is
BioPython
?

3


An Open Source Python module that provides many functions
that are highly relevant to bioinformatics:


Functions to read sequences in a variety of file formats


Functions to query databases over the network


Functions to parse output from sequence analysis programs


Functions to perform sequence analysis


BioPython

website contains the
BioPython

module plus tutorials
and examples:


http://www.biopython.org








First, download
BioPython

from the
BioPython

website:


http://www.biopython.org


Install on your computer


Include appropriate module. For list of modules
and descriptions see:


http://www.biopython.org/DIST/docs/api/



Using
BioPython

4


Affy


Align


AlignAce


AlignIO


Alphabet


Application


Blast


CAPS


Clustalw


Cluster


Compass


Crystal


Data


DocSQL


E
u
tils


EZRetrieve


Emboss


Encodings


Entrez


Enzyme


ExPASy


FSSP


Fasta


File


FilteredReader


GA


GFF


GenBank


Geo


Graphics


HMM


HotRand


Index


InterPro


KDTree


KEGG


LogisticRegression


MEME


MarkovModel


MaxEntropy


Medline


Motif


NMR


NaiveBayes


Ndb


NetCatch


NeuralNetwork


Nexus


PDB


ParserSupport


Parsers


Pathway


PopGen


PropertyManager


Prosite


PubMed


Restriction


SCOP


SVDSuperimposer


Search


Seq


SeqFeature


SeqIO


SeqRecord


SeqUtils


Sequencing


Statistics


SubsMat


SwissProt


Transcribe


Translate


UniGene


WWW


Wise


D
istance


kNN


I
stfns


M
athfns


P
airwise2


S
tringfns


T
riefind


U
tils


BioSQL




BioPython Modules

5

http://biopython.org/DIST/docs/api/



from Bio import
SeqIO

handle = open(“
hemoglobin.fasta
”)

for
sr

in
SeqIO.parse
(
handle,"fasta
"):


print sr.id


print sr.seq

handle.close
()

Simple example #1

6

Read in Fasta sequence file



from Bio import
SeqIO

handle = open(“
hemoglobin.gb
”)

for
sr

in
SeqIO.parse
(
handle,“genbank
"):


print sr.id


print sr.seq

handle.close
()

Simple example #2

7

Read in Genbank sequence file



from Bio import
SeqIO

handle = open(“
hemoglobin.uniprot
”)

for
sr

in
SeqIO.parse
(
handle,“swiss
"):


print sr.id


print sr.seq

handle.close
()

Simple example #3

8

Read in UniProt (swiss/trembl) sequence file



from Bio import
SeqIO

from Bio import
Entrez

#Please use your REAL email address below:

Entrez.email
=“youremail@yourdomain.edu”

handle =
Entrez.efetch
(db=“
nucleotide”,rettype
=“
gb”,id
=“NM_000518”)

sr

=
SeqIO.parse
(
handle,“genbank
").next()

print sr.id

print sr.seq

handle.close
()

Fetch Over the Network #1

9

Fetch Genbank Sequence from the Network



from Bio import
SeqIO

from Bio import
Entrez

#Please use your REAL email address below:

Entrez.email
=“youremail@yourdomain.edu”

handle =
Entrez.efetch
(db=“
nucleotide”,rettype
=“
gb”,id
=“NM_000518”)

# The blue line and the red line are equivalent:

#
sr

=
SeqIO.parse
(
handle,“genbank
").next()

sr

=
SeqIO.read
(
handle,”genbank
”)

print sr.id

print sr.seq

handle.close
()

Fetch Over the Network #2

10

Fetch Genbank Sequence from the Network


from Bio import
SeqIO

from Bio import
Entrez

Entrez.email
="ropelews@psc.edu"

InHandle

=
Entrez.efetch
(db="
nucleotide",rettype
="
gb",id
="NM_000518")

OutHandle

= open("NM_000518.gb","w")

sr

=
SeqIO.parse
(
InHandle
, "
genbank
")

SeqIO.write
(
sr,OutHandle,"genbank
")

InHandle.close
()

OutHandle.close
()

Fetch Over the Network #3

11

Fetch Genbank Sequence from the Network and Save


from Bio import
SeqIO

from Bio import
Entrez

Entrez.email
="ropelews@psc.edu"

InHandle

=
Entrez.efetch
(db="
nucleotide",rettype
="
gb",id
="NM_000518")

OutHandle

= open("NM_000518.
fasta
","w")

sr

=
SeqIO.parse
(
InHandle
, "
genbank
")

SeqIO.write
(
sr,OutHandle,“
fasta
")

InHandle.close
()

OutHandle.close
()

Fetch Over the Network #4

12

Fetch Genbank Sequence from the Network and Save


Using
BioPython
, write a program to read in
several sequences in a file in the
Uniprot
/Swiss file
format and save them in a file as FASTA format.


You may use the
Hemoglobin.swiss

test file from:


http://staff.psc.edu/ropelews/MARC2009/index.html

13

Homework


Use your own routine when:


The algorithm or coding is interesting to you


BioPython data structure mapping is too complex for your task


You want to “own” the source code from a copyright perspective



Use Biopython when:


Routine fits your needs


Routine is unchallenging or boring
-

Why waste your time?


Routine will take you a lot of effort to write




Extend Biopython routine when:


Routine almost does what you want but not quite


Challenging for the beginning programmer! Can you read and
understand someone else’s code?


BioPython vs Your Own Routines

14

15

Global Alignment


0 G C T G
G

A
A

G
G

C A T

0
0

-
7
-
14
-
21
-
28
-
35
-
42
-
49
-
56
-
63
-
70
-
77
-
84

G
-
7
5

-
2
-
9
-
16
-
23
-
30
-
37
-
44
-
51
-
58
-
65
-
72

C
-
14
-
2
10 3

-
4
-
11
-
18
-
25
-
32
-
39
-
46
-
53
-
60

A
-
21
-
9 3
6
-
1

-
8
-
6
-
13
-
20
-
27
-
34
-
41
-
48

G
-
28
-
16
-
4
-
1
11

4
-
3

-
10
-
8
-
15
-
22
-
29
-
36

A
-
35
-
23
-
11
-
8 4 7
9

2
-
5

-
12
-
19
-
17
-
24

G
-
42
-
30
-
18
-
15
-
3 9 3
5 7

0

-
7
-
14
-
21

C
-
49
-
37
-
25
-
22
-
10 2 5
-
1 1 3
5

-
2
-
9

A
-
56
-
44
-
32
-
29
-
17
-
5 7 10 3
-
3
-
1
10

3

C
-
63
-
51
-
39
-
36
-
24
-
12 0 3 6
-
1 2
3

6

T
-
70
-
58
-
46
-
34
-
31
-
19
-
7
-
4
-
1 2
-
5
-
2
8


G C T G G A A G G C A
-

T

G C
-

A G
-

A
-

G C A C T

PAM 47, Match = 5, Mismatch =
-
4, Open Gap = 0, Extend Gap =
-
7


from Bio import pairwise2


alignments = pairwise2.align.globalms("GCTGGAAGGCAT",
\


"GCAGAGCACT", 5,
-
4, 0,
-
7)

num=0

for align in alignments:


num=num+1


print "Alignment #",num


print align[0]


print align[1]


print " "

Global Alignment

16

from Bio.Blast import NCBIWWW

from Bio.Blast import NCBIXML

from Bio import SeqIO


query_file = open('hemoglobin.fasta')

save_file = open("hemoglobin.xml", 'w')


record = SeqIO.read(query_file, format="fasta")

results_handle = NCBIWWW.qblast("blastp", "swissprot",
\


record.seq, expect=1, matrix_name='BLOSUM62')

blast_results = results_handle.read()

save_file.write(blast_results)

save_file.close()

Network Blast #1

17

from
Bio.Blast

import NCBIWWW

from
Bio.Blast

import NCBIXML

from Bio import
SeqIO

query_file

= open('
hemoglobin.fasta
')

save_file

= open("hemoglobin.txt", 'w')

record =
SeqIO.read
(
query_file
, format="
fasta
")

results_handle

=
NCBIWWW.qblast
("
blastp
", "
swissprot
", record.seq, expect=1,
\


matrix_name
='BLOSUM62', descriptions=2000, alignments=2000,
hitlist_size
=2000)

blast_results

=
NCBIXML.parse
(
results_handle
).next()

alncnt
=0

for align in
blast_results.alignments
:


alncnt
=
alncnt

+ 1


ln
='Alignment # ' +
str
(
alncnt
) + ' ' +
align.accession



save_file.write
(
ln

+ '
\
n')


save_file.write
(
align.title
[0:132] + '
\
n')


for
hsp

in
align.hsps
:


save_file.write
(
hsp.query

+ '
\
n')


save_file.write
(
hsp.match

+ '
\
n')


save_file.write
(
hsp.sbjct

+ '
\
n')


save_file.write
('
\
n')

query_file.close
()

save_file.close
()

Network Blast #2

18

Network Blast/XML

19