Essential Bioinformatics and Biocomputing (LSM2104 ... - BIDD

abalonestrawBiotechnology

Oct 2, 2013 (4 years and 1 month ago)

96 views

Lecture
9: Back to the Basics:

Python and Application in Bioinformatics


Y.Z. Chen

Department of Pharmacy

National University of Singapore


Tel: 65
-
6616
-
6877; Email:
phacyz@nus.edu.sg

; Web:
http://bidd.nus.edu.sg


Content


What is python?


Python basics


Application in bioinformatics

Why Programming?

Programming skills needed for tasks such as:



Write a program to do the same PUBMED search every
week and list the new hits for molecular interactions,
network regulations.



Do a BLAST search against sequences which are on your
list of proteins with known kinetic data



Merge results from different searches



Import data into Excel for plotting



What Programming Tools?


Popularly used programming tools:



Programming languages
-

Perl, Python, C, C++, Java,
Visual Basic, PHP, Fortran



Software libraries
-

BioPerl, Biopython, and BioJava



Databases
-

MySQL, Postgres, Oracle





Statistics of Software Usage

Nature Biotech 25, 390
(2007)

Why Python?


Suitable for relatively small automated tasks such as search
-
and
-
replace over a large number of text files, rename and rearrange files,
write a small database, specialized GUI application, and development of
simple games



Faster and easier alternatives to C/C++/Java



Simpler to use, available on Windows, MacOS X, and Unix operating
systems



A real programming language, more structure and support than shell
scripts or batch files can offer, more error checking than C, high
-
level
data types built in, applicable to a much larger problem domain than
Awk or even Perl yet in many cases equally easy to use



An interpreted language, which can save you considerable time during
program development because no compilation and linking is necessary.

Why Python?


Allows you to split program into modules used in other Python
programs, comes with a large collection of standard modules such as
file I/O, system calls, sockets, interfaces to graphical user interface
toolkits.



Enables programs to be written compactly and readably at typically
much shorter length than equivalent C, C++, Java programs, for
several reasons:


The high
-
level data types allow you to express complex
operations in a single statement;


statement grouping is done by indentation instead of beginning
and ending brackets;


no variable or argument declarations are necessary.



Extensible: if you know how to program in C it is easy to add a new
built
-
in function or module to the interpreter, you can link the Python
interpreter into an application written in C and use it as an extension
or command language for that application.

What is Python?

Python is a Programming Language



Started by Guido van Rossum in 1990 as a way to write
software for the Amoeba operating system. Influenced by
ABC, which was designed to be easy to learn. It is also
very useful for large programs written by expert
programmers.



The word "Python" comes from the comedy troupe "Monty
Python." Words and jokes from the skits and movies
appear often in Python software, including "spam," "idle,"
and "grail"



What is Python?

Python Properties



Interpreted Language


Interactive mode


Imperative and "Object
-
Oriented"


Cross
-
platform


Doesn't try to guess what you mean


Great for team projects


Popular for web applications, testing, and XML


Extremely popular for chemical informatics (but not so
much in bioinformatics)



What is Python?

Interactive Mode



Python has an interactive mode. You can type Python code
and see the results immediately. To start Python, open a
unix shell and type "python".


> python

Python 2.3.3 (#1, Jan 29 2004, 22:55:13)

[GCC 3.3.3 [FreeBSD] 20031106] on freebsd5

Type "help", "copyright", "credits" or "license" for more information.

>>>



At the >>> prompt you can enter Python code.


Python Resources
http://python.org/

Python Resources

http://www.pasteur.fr/recherche/unites/sis/formation/python/index.html

Python Resources

http://www.pasteur.fr/recherche/unites/sis/formation/python/index.html

Example: Using Python as a calculator


>>> 2+3

5

>>> 4+6*8

52

>>> abs(
-
4)

4

>>> help(abs)

Help on built
-
in function abs:


abs(...)


abs(number)
-
> number


Return the absolute value of the argument.


>>> 89**34

1902217732808760980190430983601716818363305103120555045416541165041L

>>> print 89**34

1902217732808760980190430983601716818363305103120555045416541165041

>>> "What... is the air
-
speed velocity of an unladen swallow?"

'What... is the air
-
speed velocity of an unladen swallow?'

>>> print "What do you mean? An African or European swallow?"

What do you mean? An African or European swallow?

What is Python?

Example: Importing a module


>>> import math

>>> help(math)


Help on module math:

NAME


math

FILE


/usr/local/lib/python2.3/lib
-
dynload/math.so

DESCRIPTION


This module is always available. It provides access to the mathematical
functions defined by the C standard.


>>> math.pi

3.1415926535897931

>>> math.sin(math.pi/2.0)

1.0

>>>

What is Python?

Example: Print the Time of Day


>>> import datetime

>>> now = datetime.datetime.now()

>>> now

datetime.datetime(2008, 2, 2, 19, 23, 28, 809434)

>>> print now

2008
-
02
-
02 19:23:28.809434

>>> print "Now is", now.strftime("%d
-
%m
-
%Y"), "at",
now.strftime("%H:%M")

Now is 02
-
02
-
2008 at 19:23

>>>


The notation name1.name2 is called an attribute lookup. In this case,
name2 is an attribute of name1 and has some value.

>>> now.day

2

>>> now.year

2008

>>> now.hour

19

Simple Python script

Code:


# file: simple_code.py

import math

import datetime

print "log(1e23) =", math.log(1e23)

print "2*sin(3.1414) = ", 2*math.sin(3.1414)

now = datetime.datetime.now()

print "Now is", now.strftime("%d
-
%m
-
%Y"), "at", now.strftime("%H:%M")

print "or, more precisely, %s" % now


Output:


> python simple_code.py

log(1e23) = 52.9594571389

2*sin(3.1414) = 0.000385307177203

Now is 02
-
02
-
2008 at 19:55

or, more precisely, 2008
-
02
-
02 19:55:43.046953

>

Python Script

Creating Python Script



A Python program is just a text file. You can use any text (programmer's)
editor. There are several on the Linux machines, including vi,
XEmacs
,
Kate,
xvim
, and
nedit
. You can also use one of the free IDEs like idle,
PyShell
, or (under Microsoft Windows)
Pythonwin
.


Running Python Script



Option 1: Run the python program from the command line, giving it the
name of the script file to run.


>
python now.py

Now is 02
-
02
-
2004 at 19:55

or, more precisely, 2004
-
02
-
02 19:55:43.046953

>


Python Script

Running Python Script



Option 2: Put the magic comment #!/usr/bin/env python as the very first
line in the program.


Code:


#!/usr/bin/env python

# now.py

import datetime

now = datetime.datetime.now()

print "Now is", now.strftime("%d
-
%m
-
%Y"), "at", now.strftime("%H:%M")

print "or, more precisely, %s" % now


Make the script executable with chmod +x now.py


> chmod +x now.py


Then run the program as if it's any other Unix program


> now.py

Now is 02
-
02
-
2004 at 19:55

or, more precisely, 2004
-
02
-
02 19:55:43.046953

Python Statements



Statement examples:


sum = 2 + 2 # this is a statement


name = raw_input("What is your name?")
# these are two statements

print "Hello,", name


print "Did you know that your name has",
\


len(name), "letters?"
# This is one statement spread across 2 lines


# Another way to extend a statement across several lines

print "Here is your name repeated 7 times:", (


name * 7


)


Python Statements


Blocks, If and for statements


EcoRI = "GAATTC"

sequence = raw_input("Enter a DNA sequence:")

if EcoRI in sequence:


print "Sequence contains an EcoRI site"
# This is a one
-
line block


import sys

sequence2 = raw_input("Enter another sequence:")

if len(sequence2) < 100:


print "Sequence is too small. Throw it back."
# a two
-
line block


sys.exit(0)


sequences = (sequence, sequence2)

for seq in sequences:


print "sequence length =", len(seq)
# a block ...


for c in "ATCG":


print "#%s = %d" % (c, seq.count("C"))
# ... with a block inside it


Python Objects and Literals

String Literals


# single quotes

'Who said "to be or not to be"?'


# double quotes

"DNA goes from 5' to 3'."


# escaped quotes

"
\
"That's not fair!
\
" yelled my sister."

# creates: "That's not fair!" yelled my sister


# triple quoted strings, with single quotes

'''This one string can go

over several lines'''


# "raw" strings, mostly used for regular expressions

r"
\
"That's not fair!
\
" yelled my sister."

# creates:
\
"That's not fair!
\
" yelled my sister


# You can even have raw triple double quoted strings!

r"""So there!"“”

Python Objects and Literals

Numeric Literals


123 # an integer


1.23 # a floating point number



-
1.23 # a negative floating point number



1.23E45; # scientific notation



0x7b; # hexadecimal notation (decimal 123)



0173; # octal notation (decimal 123)


12+3*j; # complex number 12 + 3i (Note that Python uses "j"!)


2147483648L # a long integer


Python Objects and Literals

List literal


>>> data = [1, 4, 9, 16]

>>> data[0]

1

>>> data[1]

4

>>> data[2] = 7

>>> data

[1, 4, 7, 16]

>>> data[1:3]

[4, 9]

>>>


Python Objects and Literals

Tuple literal


>>> data = (1, 4, 9, 16)

>>> data[1]

4

>>> data[2] = 7

Traceback (most recent call last):


File "", line 1, in ?

TypeError: object doesn't support item assignment

>>>



Dictionary literal


>>> d = {"A": "ALA", "C": "CYS", "D": "ASP"}

>>> print d["A"]

ALA

>>>

Python Operators

Some operation using numbers


>>> (1+2)**2

9

>>> (2+3*4)/2

7

>>> 7%3

# % is the modulo operator

1

>>> 7 == 7

True

>>>



Python Operators


Some operation using strings


>>> "Andrew" + " " + "Dalke"

'Andrew Dalke‘


>>> "*" * 10

'**********'

>>> "My name is %s. What's your name?" % "Andrew"

'My name is Andrew. What's your name‘


>>> "My first name is %s and family name is %s" % ("Andrew",
"Dalke")

'My first name is Andrew and family name is Dalke‘


>>> "My first name is %(first)s. Is yours also %(first)s?" %
\

... {"first": "Andrew", "family": "Dalke"}

'My first name is Andrew. Is yours also Andrew?‘


>>> "Andrew" == "Dalke"

False

>>>

Python Functions

http://python.org/doc/current/lib/built
-
in
-
funcs.html

Python Functions


String Methods

>>> seq = "AATGCCG"

>>> seq.lower()

'aatgccg'

>>> seq.count("A")

2

>>> seq.find("GC")

3

>>> seq.find("gc")

-
1

>>> seq.replace("C", "U")

'AATGUUG'

>>> import string

>>> seq.translate(string.maketrans("ATCG", "TAGC"))

'TTACGGC'

>>> # Make the reverse complement

>>> seq.translate(string.maketrans("ATCG", "TAGC"))[::
-
1]

'CGGCATT'

>>>

Python Functions


Special Methods


Some methods are used so often that they have special syntax.


>>> s = "AATGCCGTTTAT"

>>> s[0]
# index

'A'

>>> s[1:4]
# slice from beginning to end

'ATG'

>>> s[:4]
# default beginning is position 0

'AATG'

>>> s[
-
1] # index from the end

'T'

>>> s[
-
3:]
# default end includes the last character

'TAT'

>>> s[3:
-
3]

'GCCGTT'

>>> s[::2]
# the optional third parameter is the stride

'ATCGTA'

>>> s[::
-
1]
# returns the string, reversed

'TATTTGCCGTAA'

>>>

Python Processing Command Line Arguments



When a Python script is run, its command
-
line arguments (if any) are stored in
the list sys.argv.


Code:


#!/usr/bin/env python

# file: echo.py

import sys

print sys.argv


Output:


> chmod +x echo.py

> echo.py tuna

['echo.py', 'tuna']

> echo.py tuna fish

['echo.py', 'tuna', 'fish']

> echo.py "tuna fish"

['echo.py', 'tuna fish']

> echo.py

['echo.py']

>

Python Processing Command Line Arguments


Computing the Hypotenuse of a Right Triangle


Code:


#!/usr/bin/env python

# file: hypotenuse.py


import sys, math


if len(sys.argv) != 3: # the program name and the two arguments


# stop the program and print an error message


sys.exit("Must provide two positive numbers")


# Convert the two arguments from strings into numbers

x = float(sys.argv[1])

y = float(sys.argv[2])

print "Hypotenuse =", math.sqrt(x**2+y**2)


Output:


> hypotenuse.py 5 12

Hypotenuse = 13.0

>

Python I/O (Input / Output)

Input



Text input comes from sys.stdin. It has a method called readline which
reads a line of input.


>>> import sys

>>> s = sys.stdin.readline()

This is a line of text. The line ends when I press 'Enter'.

>>> s

"This is a line of text. The line ends when I press 'Enter'.
\
n"

>>>



You can also use the raw_input function to get a string from sys.stdin.
This function takes an optional argument which is used as the prompt.


>>> name = raw_input("What is your name? ")

What is your name? Andrew

>>> print name, "is a nice name"

Andrew is a nice name

>>>

Python I/O (Input / Output)

Output



Most Python text output goes to the sys.stdout file object. You've been
using the print statement, which uses sys.stdout under the covers.
Output file handles have a write function which writes a string to the file
with no extra interpretation.


>>> a, b, c = 1, 4, 9

>>> print "The first three squares are", a, b, "and", c

The first three squares are 1 4 and 9

>>> print "The first three squares are", a, ",", b, "and", c, "."

The first three squares are 1 , 4 and 9 .

>>> print "The first three squares are %s, %s and %s." % (a, b, c)

The first three squares are 1, 4 and 9.

>>> import sys

>>> sys.stdout.write("The first three squares are %s, %s and
%s.
\
n" %

... (a, b, c))

The first three squares are 1, 4 and 9

>>>

Python Applications in Bioinformatics

BLAST output parsing



BLAST is the most widely used bioinformatics tool to search large
sequence databases. The original BLAST authors expected the output
to be read by people only. But many use BLAST as part of a larger
algorithm and want to automate the BLAST step by using parsers for
BLAST output flavors (BLASTN, BLASTP, TBLASTX, WU
-
BLAST, and
so on). BLAST parsers have been developed and put into library in
Bioperl, Biopython, BioJava, etc., which all have BLAST output parsers.


First few lines of the BLASTP output



Python Applications in Bioinformatics

BLAST output parsing



Getting program version information







Program reporting the version information of a BLAST file



Python Applications in Bioinformatics

BLAST output parsing



Getting no of sequences in the database and no of letters


Python Applications in Bioinformatics

BLAST output parsing



Reading description lines


Python Applications in Bioinformatics

BLAST output parsing



Reading description lines