Introduction to Programming using Python

adventurescoldSoftware and s/w Development

Nov 7, 2013 (3 years and 7 months ago)

257 views

Introduction to Programming using
Python
Programming Course for Biologists at the
Pasteur Institute
by Katja Schuerer,Corinne Maufrais,Catherine Letondal,Eric Deveaud,and
Marie-Agnes Petit
Introduction to Programming using Python [http://www.python.org/]:
Programming Course for Biologists at the Pasteur Institute
by Katja Schuerer,Corinne Maufrais,Catherine Letondal,Eric Deveaud,and Marie-Agnes Petit
Published February,1 2005
Copyright ©2005 Pasteur Institute [http://www.pasteur.fr/]
The objective of this course is to teach programming concepts to biologists.It is thus aimed at people who are
not professional computer scientists,but who need a better control of computers for their own research.This pro-
gramming course is part of a course in informatics for biology [http://www.pasteur.fr/formation/infobio/infobio-
en.html].If you are already a programmer,and if you are just looking for an introduction to Python,you can go
to this Python course [http://www.pasteur.fr/recherche/unites/sis/formation/python/] (in Bioinformatics).
PDF version of this course [support.pdf]
This course is still under construction.Comments are welcome.
Handouts for practical sessions (still under construction) will be available on request.
Contact:ieb@pasteur.fr
Table of Contents
1.Introduction................................................................................1
1.1.First session.........................................................................1
1.2.Documentation......................................................................6
1.3.Why Python........................................................................6
1.4.Programming Languages.............................................................6
2.Variables..................................................................................9
2.1.Data,values and types of values.......................................................9
2.2.Variables or naming values...........................................................9
2.3.Variable and keywords,variable syntax...............................................10
2.4.Namespaces or representing variables.................................................11
2.5.Reassignment of variables...........................................................12
3.Statements,expressions and functions.......................................................15
3.1.Statements.........................................................................15
3.2.Sequences or chaining statements....................................................15
3.3.Functions..........................................................................15
3.4.Operations.........................................................................16
3.5.Composition and Evaluation of Expressions...........................................16
4.Communication with outside................................................................19
4.1.Output............................................................................19
4.2.Formatting strings..................................................................19
4.3.Input..............................................................................22
5.Programexecution.........................................................................25
5.1.Executing code froma le...........................................................25
5.2.Interpreter and Compiler............................................................27
6.Strings...................................................................................31
6.1.Values as objects...................................................................31
6.2.Working with strings................................................................32
7.Branching and Decisions...................................................................37
7.1.Conditional execution...............................................................37
7.2.Conditions and Boolean expressions..................................................38
7.3.Logical operators...................................................................39
7.4.Alternative execution...............................................................40
7.5.Chained conditional execution.......................................................41
7.6.Nested conditions..................................................................42
7.7.Solutions..........................................................................44
8.Dening Functions........................................................................45
8.1.Dening Functions.................................................................45
8.2.Parameters and Arguments or the difference between a function denition and a function call 47
8.3.Functions and namespaces...........................................................49
8.4.Boolean functions..................................................................51
9.Collections...............................................................................53
9.1.Datatypes for collections............................................................53
9.2.Methods,Operators and Functions on Lists............................................55
9.3.Methods,Operators and Functions on Dictionaries.....................................57
9.4.What data type for which collection..................................................58
10.Repetitions..............................................................................59
10.1.Repetitions.......................................................................59
10.2.The for loop......................................................................59
10.3.The while loop....................................................................64
10.4.Comparison of for and while loops..................................................67
10.5.Range and Xrange objects..........................................................68
10.6.The map function.................................................................68
10.7.Solutions.........................................................................70
11.Nested data structures.....................................................................71
11.1.Nested data structures..............................................................71
11.2.Identity of objects.................................................................73
11.3.Copying complex data structures....................................................75
11.4.Modifying nested structures........................................................76
12.Files....................................................................................81
12.1.Handle les in programs...........................................................81
12.2.Reading data fromles............................................................83
12.3.Writing in les....................................................................84
12.4.Design problems..................................................................87
12.5.Documentation strings.............................................................91
13.Recursive functions.......................................................................97
13.1.Recursive functions denitions......................................................97
13.2.Flow of execution of recursive functions.............................................99
13.3.Recursive data structures..........................................................101
14.Exceptions.............................................................................107
14.1.General Mechanism..............................................................107
14.2.Python built-in exceptions.........................................................107
14.3.Raising exceptions...............................................................108
14.4.Dening exceptions..............................................................109
15.Modules and packages in Python..........................................................111
15.1.Modules........................................................................111
15.1.1.Using modules............................................................111
15.1.2.Building modules.........................................................111
15.1.3.Where are the modules?...................................................112
15.1.4.How does it work?........................................................113
15.1.5.Running a module fromthe command line...................................115
15.2.Packages........................................................................115
15.2.1.Loading..................................................................116
15.3.Getting information on available modules and packages..............................118
16.Scripting...............................................................................119
16.1.Using the systemenvironment:os and sys modules..................................119
16.2.Running Programs...............................................................120
16.3.Parsing command line options with getopt..........................................123
16.4.Parsing..........................................................................125
16.5.Searching for patterns.............................................................128
16.5.1.Introduction to regular expressions..........................................128
16.5.2.Regular expressions in Python..............................................129
16.5.3.Prosite...................................................................133
16.5.4.Searching for patterns and parsing..........................................134
17.Object-oriented programming.............................................................135
17.1.Introduction.....................................................................135
17.2.What is a class?An example......................................................135
17.2.1.Objects description........................................................135
17.2.2.Methods.................................................................135
17.2.3.Class denition...........................................................136
17.3.Using classes in Python...........................................................138
17.3.1.Creating instances.........................................................138
17.4.Combining objects...............................................................140
17.5.Classes and objects in Python:technical aspects.....................................144
17.5.1.Namespaces..............................................................144
17.5.2.Objects lifespan...........................................................148
17.5.3.Objects equality...........................................................149
17.5.4.Classes and types.........................................................150
17.5.5.Getting information on classes and instances.................................150
18.Object-oriented design...................................................................153
18.1.Introduction.....................................................................153
18.2.Components.....................................................................153
18.2.1.Software quality factors....................................................153
18.2.2.Large scale programming..................................................153
18.2.3.Modularity...............................................................154
18.2.4.Methodology.............................................................156
18.2.5.Reusability...............................................................156
18.3.Abstract Data Types..............................................................157
18.3.1.Denition................................................................157
18.3.2.Information hiding........................................................160
18.3.3.Using special methods within classes........................................163
18.4.Inheritance:sharing code among classes............................................163
18.4.1.Introduction..............................................................163
18.4.2.Discussion...............................................................169
18.5.Flexibility.......................................................................173
18.5.1.Summary of mechanisms for exibility in Python.............................173
18.5.2.Manual overloading.......................................................174
18.6.Object-oriented design patterns....................................................176
Bibliography...............................................................................187
List of Figures
1.1.History of programming languages(Source)..................................................7
2.1.Namespace..............................................................................11
2.2.Reassigning values to variables............................................................12
4.1.Interpretation of formatting templates......................................................20
5.1.Comparison of compiled and interpreted code...............................................28
5.2.Execution of byte compiled code..........................................................28
6.1.String indices............................................................................33
7.1.Flow of execution of a simple condition....................................................37
7.2.If statement.............................................................................37
7.3.Block structure of the if statement.........................................................38
7.4.Flow of execution of an alternative condition................................................40
7.5.Multiple alternatives or Chained conditions.................................................41
7.6.Nested conditions........................................................................43
7.7.Multiple alternatives without elif..........................................................44
8.1.Function denitions......................................................................45
8.2.Blocks and indentation...................................................................47
8.3.Stack diagramof function calls............................................................48
9.1.Comparison some collection datatypes.....................................................55
10.1.The for loop............................................................................60
10.2.Flow of execution of a while statement....................................................64
10.3.Structure of the while statement..........................................................66
10.4.Passing functions as arguments...........................................................69
11.1.Representation of nested lists............................................................71
11.2.Accessing elements in nested lists........................................................72
11.3.Representation of a nested dictionary.....................................................73
11.4.List comparison........................................................................74
11.5.Copying nested structures................................................................76
11.6.Modifying compound objects............................................................77
12.1.ReBase le format......................................................................83
12.2.Flowchart of the processing of the sequence...............................................90
13.1.Stack diagramof recursive function calls..................................................99
13.2.A phylogenetic tree topology...........................................................101
13.3.Tree representation using a recursive list structure.........................................101
14.1.Exceptions class hierarchy..............................................................107
15.1.Module namespace....................................................................113
15.2.Loading specic components...........................................................114
16.1.Manual parsing........................................................................125
16.2.Event-based parsing...................................................................125
16.3.Parsing:decorated grammar............................................................126
16.4.Parsing result as a hierarchical document.................................................127
16.5.Pattern searching......................................................................129
16.6.Python regular expressions.............................................................130
16.7.Python regular expressions:classes and methods summary.................................133
17.1.Motif object...........................................................................135
17.2.Representation showing object's methods as counters......................................136
17.3.A Match object o1 with embedded Motif m1 and Protein p1 (not feasible in Python)..........141
17.4.Two match objects and a pattern.........................................................141
17.5.UML diagramfor the Motif,Match and Protein classes....................................142
17.6.Classes and instances dictionaries.......................................................144
17.7.Class attributes in class dictionary.......................................................146
17.8.Classes methods and bound methods.....................................................147
17.9.Types of classes and objects............................................................150
18.1.Components as a language..............................................................154
18.2.Three implementations of stacks........................................................158
18.3.Post ofce representation of the ADT stack...............................................161
18.4.Dynamic binding (1)...................................................................167
18.5.Dynamic binding (2)...................................................................167
18.6.UML diagramfor inheritance...........................................................168
18.7.Multiple Inheritance...................................................................169
18.8.Alignment inheritance classes hierarchy..................................................170
18.9.Alignment classes with more composition................................................172
18.10.Delegation...........................................................................178
18.11.A composite tree.....................................................................182
List of Tables
3.1.Order of operator evaluation (highest to lowest).............................................17
4.1.String formatting:Conversion characters...................................................21
4.2.String formatting:Modiers..............................................................22
4.3.Type conversion functions................................................................23
6.1.String methods,operators and builtin functions..............................................34
6.2.Boolean methods and operators on strings..................................................35
7.1.Boolean operators........................................................................39
9.1.Sequence types:Operators and Functions...................................................56
9.2.List methods............................................................................56
9.3.Dictionary methods and operations........................................................57
12.1.File methods...........................................................................82
12.2.File modes.............................................................................82
18.1.Stack class interface...................................................................158
18.2.Some of the special methods to redene Python operators..................................163
List of Examples
5.1.Executing code froma le................................................................25
8.1.More complex function denition..........................................................47
8.2.Function to check whether a character is a valid amino acid...................................52
10.1.Translate a cds sequence into its corresponding protein sequence.............................63
10.2.First example of a while loop............................................................65
10.3.Translation of a cds sequence using the while statement.....................................65
11.1.A mixed nested datastructure.............................................................73
12.1.Reading fromles......................................................................81
12.2.Restriction of a DNA sequence...........................................................89
14.1.Filename error........................................................................107
14.2.Raising an exception in case of a wrong DNA character....................................109
14.3.Raising your own exception in case of a wrong DNA character.............................109
14.4.Exceptions dened in Biopython........................................................110
15.1.A module.............................................................................112
15.2.Using the Bio.Fasta package............................................................116
16.1.Walking subdirectories.................................................................119
16.2.Running a program(1).................................................................120
16.3.Running a program(2).................................................................121
16.4.Running a program(3).................................................................122
16.5.Getopt example.......................................................................123
16.6.Searching for the occurrence of PS00079 and PS00080 Prosite patterns in the Human Ferroxidase
protein.....................................................................................131
17.1.Motif,a class for protein motifs.........................................................137
18.1.A Stack...............................................................................157
18.2.Stack class using an array-up implementation............................................161
18.3.Dening a Stack special method........................................................163
18.4.Inheritance example (1):sequences......................................................164
18.5.Inheritance example (2):alignment scoring...............................................164
18.6.Critique of inheritance:alignment classes................................................170
18.7.Curve class:manual overloading........................................................174
18.8.An uppercase sequence class............................................................179
18.9.A composite tree......................................................................182
List of Exercises
3.1.Composition............................................................................17
5.1.Execute code froma le..................................................................26
7.1.Chained conditions.......................................................................42
7.2.Nested condition.........................................................................42
10.1.Repetitions.............................................................................59
10.2.Write the complete codon usage function..................................................64
10.3.Rewrite for as while.....................................................................67
11.1.Representing complex structures.........................................................73
12.1.Multiple sequences for all enzymes.......................................................91
15.1.Locating modules......................................................................113
15.2.Bio.SwissProt package.................................................................117
15.3.Using a class froma module............................................................117
15.4.Import fromBio.Clustalw..............................................................117
16.1.Basename of the current working directory...............................................119
16.2.Finding les in directories..............................................................120
17.1.A Dictionary class.....................................................................145
18.1.Alternative implementation of the Stack class.............................................163
18.2.Example of an abstract framework:Enzyme parser........................................173
18.3.An analyzed sequence class.............................................................180
18.4.A partially editable sequence............................................................180
Chapter 1.IntroductionChapter 1.Introduction
1.1.First session
Python 2.2 (#1,Feb 19 2002,11:58:49) [C] on osf1V5
Type"help","copyright","credits"or"license"for more information.
>>> 1 + 5
6
>>> 2 * 5
10
>>>'aaa'
'aaa'
>>> len('aaa')
3
What happened?
>>> len('aaa') + len('ttt')
6
>>> len('aaa') + len('ttt') + 1
7
>>>'aaa'+'ttt'
'aaattt'
>>>'aaa'+ 5
Traceback (most recent call last):
File"<stdin>",line 1,in?
TypeError:cannot concatenate'str'and'int'objects
Read carefully the error message,and explain it.
How to protect you from this kind of problem?
>>> type(1)
<type'int'>
>>> type('1')
<type'string'>
Do you know other possible data types?
>>> type(1.0)
<type'float'>
>>> 1 == 1
True
>>> 1 == 2
False1
Chapter 1.IntroductionYou can associate a name to a value:
>>> a = 3
>>> a
3
The interpreter displays 3 instead of a.
>>> a = 2
>>> a
2
>>> a * 5
10
>>> b = a * 5
>>> b
10
>>> a = 1
>>> b
10
Why hasn't b changed?
What is the difference between:
>>> b = a * 5
and:
>>> b = 5
?
>>> a = 1 in this case a is a number
>>> a + 2
3
>>> a ='1'in this case a is a string
>>> a + 1
Traceback (most recent call last):
File"<stdin>",line 1,in?
TypeError:cannot add type"int"to string
What do you conclude about the type of a variable?
Some magical stuff,that will be explained later:
>>> from string import *
We can also perform calculus on strings:
>>> codon='atg'
>>> codon * 3
'atgatgatg'
>>> seq1 ='agcgccttgaattcggcaccaggcaaatctcaaggagaagttccggggagaaggtgaaga'
>>> seq2 ='cggggagtggggagttgagtcgcaagatgagcgagcggatgtccactatgagcgataata'2
Chapter 1.IntroductionHow do you concatenate seq1 and seq2 in a single string?
>>>'atgc'=='atgc'
True
>>>'atgc'=='gcta'
False
>>>'atgc'=='ATGC'
False
why are'atgc'and'ATGC'different?
We can change the case of a string:
>>> str ='atgc'
>>> upper(str)
'ATGC'
>>> str ='GATC'
>>> lower(str)
'gatc'
>>> str
'GATC'
The original string str is not modified.
>>> seq = seq1 + seq2
What is the length of the string seq?
>>> len(seq)
120
Does the string seq contain the ambiguous'n'base?
>>>'n'in seq
False
Does it contain an adenine base?
>>>'a'in seq
True
>>> seq[1]
'g'
Why?
Because in computer science,strings are numbered from 0 to string length - 1
so the first character is:
>>> seq[0]
'a'3
Chapter 1.IntroductionDisplay the 12th base.
>>> seq[11]
't'
Find the index of the last character.
>>> len(seq)
120
So,because we know the sequence length,we can display the last character
by:
>>> seq[119]
'a'
But this is not true for all the sequences we will work on.
Find a more generic way to do it.
>>> seq[len(seq) - 1]
'a'
Python provides a special form to get the characters from the end of a string:
>>> seq[-1]
'a'
>>> seq[-2]
't'
Find a way to get the first codon from the sequence
>>> seq[0] + seq[1] + seq[2]
'agc'
Python provides a form to get'slices'from strings:
>>> seq[0:3]
'agc'
>>> seq[3:6]
'gcc'
How many of each base does this sequence contains?
>>> count(seq,'a')
35
>>> count(seq,'c')
21
>>> count(seq,'g')
44
>>> count(seq,'t')
12
Count the percentage of each base on the sequence.
Example for the adenine representation4
Chapter 1.Introduction>>> long = len(seq)
>>> nb_a = count(seq,'a')
>>> (nb_a/long) * 100
0
What happened?How 35 bases from 120 could be 0 percent?
This is due to the way the numbers are represented inside the computer.
>>> float(nb_a)/long * 100
29.166666666666668
Now,let us say that you want to find specific pattern on a DNA sequence:
>>> dna ="""tgaattctatgaatggactgtccccaaagaagtaggacccactaatgcagatcctgga
tccctagctaagatgtattattctgctgtgaattcgatcccactaaagat"""
>>> EcoRI ='GAATTC'
>>> BamHI ='GGATCC'
Looking at the sequence you will see that EcoRI is present twice and
BamHI just once:
tgaattctatgaatggactgtccccaaagaagtaggacccactaatgcagatcctgga
~~~~~~ ~~~
tccctagctaagatgtattattctgctgtgaattcgatcccactaaaga
~~~ ~~~~~~
>>> count(dna,EcoRI)
0
Why??
Tip:do not forget the case:
>>> EcoRI = lower(EcoRI)
>>> EcoRI
'gaattc'
>>> count(dna,EcoRI)
2
>>> find(dna,EcoRI)
1
>>> find(dna,EcoRI,2)
88
>>> BamHI = lower(BamHI)
>>> count(dna,BamHI)
0
Why?
Tip:display the sequence:5
Chapter 1.Introduction>>> dna
'tgaattctatgaatggactgtccccaaagaagtaggacccactaatgcagatcctgga\ntccctagctaagatgtattattctgctgtgaattcgatcccactaaagat'
What is this'\n'character?
How to remove it?
>>> dna = replace(dna,'\n',)
>>> dna
'tgaattctatgaatggactgtccccaaagaagtaggacccactaatgcagatcctggatccctagctaagatgtattattctgctgtgaattcgatcccactaaagat'
>>>find(dna,BamHI)
54
Using the mechanisms we have learnt so far,produce the complement of
the dna sequence.
1.2.Documentation
1.3.Why Python
The reasons to use Python as a rst language to learn programming are manyfold.First,there are studies that
show that Python is well designed for beginners [Wang2002] and the language has been explicitely designed by
its author to be easier to learn [Rossum99].Next,it is more and more often used in bioinformatics as a general-
purpose programming language,to both build components and applications [Mangalam2002].Another very
important reason is the object-orientation,that is necessary not just for aesthetics but to scale to modern large-scale
programming [Booch94][Meyer97].Finally,a rich library of modules for scripting and network programming are
essential for bioinformatics which very often relies on the integration of existing tools.
1.4.Programming Languages
What can computers do for you?Computers can execute tasks very rapidly,but in order to achieve this they
need an accurate description of the task.They can handle a greater amount of input data than you can.But they
can not design a strategy to solve problems for you.So if you can not gure out the procedure that solve your
problemcomputers cannot help you.
The Computers own language.Computers do not understand any of the natural languages such as English,
French or German.Their proper language,also called machine language,is only composed of two symbols 0
and 1,or power on - off.They have a sort of a dictionary containing all valid words of this language.These
words are the basic instructions,such as add 1 to some number,are two values the same or copy a byte of
memory to another place.The execution of these basic instructions are encoded by hardware components of the
processor.6
Chapter 1.IntroductionProgramming languages.Programming languages belongs to the group of formal languages.Some other
examples of formal languages are the system of mathematical expressions or the languages chemists use to
describe molecules.They have been invented as intermediate abstraction level between humans and computers.
Why do not use natural languages as programming languages?Programming languages are design to prevent
problems occurring with natural language.AmbiguityNatural languages are full of ambiguities and we need the context of a word in order to
choose the appropriate meaning.minute for example is used as a unit of time as a noun,
but means tiny as adjective:only the context would distinguish the meaning.RedundancyNatural languages are full of redundancy helping to solve ambiguity problems and to
minimize misunderstandings.When you say We are playing tennis at the moment.,at
the moment is not really necessary but underlines that it is happening now.LiteracyNatural languages are full of idioms and metaphors.The most popular in English is
probably It rains cats and dogs..Besides,this can be very complicated even if you speak
a foreign language very well.
Programming languages are foreign languages for computers.Therefore you need a program that translates your
source code into the machine language.Programming languages are voluntarily unambiguous,nearly context
free and non-redundant,in order to prevent errors in the translation process.
History of programming languages.It is instructive to try to communicate with a computer in its own language.
This let you learn a lot about howprocessors work.However,in order to do this,you will have to manipulate only
0's and 1's.You will need a good memory,but probably you would never try to write a programsolving real world
problems at this basic level of machine code.
Because humans have difculties to understand,analyze and extract informations of sequences of zeros and ones,
they have written a language called Assembler that maps the instruction words to synonyms that give an idea of
what the instruction does,so for instance 0001 became add.Assembler increased the legibility of the code,but
the instruction set remained basic and depended on the hardware of the computer.
In order to write algorithms for solving more complex problems,there was a need for machine independent higher
level programming languages with a more elaborated instruction set than the low level Assembler.The rst ones
were Fortran and C and a lot more have been invented right now.A short history of a subset of programming
languages is shown in Figure 1.1.7
Chapter 1.IntroductionFigure 1.1.History of programming languages(Source)8
1956
1958
1960
1962
1964
1966
1968
1970
1972
1974
1976
1978
1980
1982
1984
1986
1988
1990
1992
1994
1996
1998
2000
2002
Smalltalk 80
Ruby
SML
Caml
OCaml
Perl
Fortran I
PL/I
Algol 60
Fortran 77
Scheme
Scheme R4RS
Common Lisp
Pascal
Haskell
Fortran 90
Prolog
Cobol
Smalltalk
C (K&R)
Tcl
C++
Java
Java 2 (v1.2)
Python
C#
Lisp
Ada 83
Eiffel
C++ (ISO)
ML
Chapter 2.VariablesChapter 2.Variables
2.1.Data,values and types of values
In the rst session we have explored some basic issues about a DNA sequence.The specic DNA sequence
'atcgat'was one of our data.For computer scientists this is also a value.During the programexecution values
are represented in the memory of the computer.In order to interpret these representations correctly values have a
type.Type
Types are sets of data or values sharing some specic properties and their associated operations.
We have modeled the DNA sequence,out of habit,as a string.
1
Strings are one of the basic types that Python can
handle.In the gc calculation we have seen two other ones:integers and oats.If you are not sure what sort of data
you are working with,you can ask Python about it.
>>> type('atcgat')
<type'str'>
>>> type(1)
<type'int'>
>>> type('1')
<type'str'>
2.2.Variables or naming values
If you need a value more than once or you need the result of a calculation later,you have to give it a name to
remember it.Computer scientists also say binding a value to a name or assign a value to a variable.Binding
Binding is the process of naming a value.Variable
Variables are names bound to values.You can also say that a variable is a name that refers to a value.
>>> EcoRI ='GAATTC'For Python the model is important because it knows nothing about DNA but it knows a lot about strings.9
Chapter 2.VariablesSo the variable EcoRI is a name that refers to the string value'GAATTC'.
The construction used to give names to values is called an assignment.Python,as a lot of other programming
languages,use the sign = to assign value to variables.The two sides of the sign = can not be interchanged.The
left side has always to be a variable and the right side a value or a result of a calculation.Caution
Do not confuse the usage of = in computer science and mathematics.In mathematics,it represents the
equality,whereas in Python it is used to give names.So all the following statements are not valid in
Python:
>>>'GAATTC'= EcoRI
SyntaxError:can't assign to literal
>>> 1 = 2
SyntaxError:can't assign to literal
We will see later how to compare things in Python (Section 11.2).
2.3.Variable and keywords,variable syntax
Python has some conventions for variable names.You can use any letter,the special characters _ and every
number provided you do not start with it.White spaces and signs with special meanings in Python,as + and -
are not allowed.Important
Python variable names are case-sensitive,so EcoRI and ecoRI are not the same variable.
>>> EcoRI ='GAATTC'
>>> ecoRI
Traceback (most recent call last):
File"<stdin>",line 1,in?
NameError:name'ecoRI'is not defined
>>> ecori
Traceback (most recent call last):
File"<stdin>",line 1,in?
NameError:name'ecori'is not defined
>>> EcoRI
'GAATTC'
Among the words you can construct with these letters,there are some reserved words for Python and can not be
used as variable names.These keywords dene the language rules and have special meanings in Python.Here is
the list of all of them:10
Chapter 2.Variablesand assert break class continue def del elif
else except exec finally for from global if
import in is lambda not or pass print
raise return try while yield
2.4.Namespaces or representing variables
How does Python nd the value referenced by a variable?Python stores bindings in a Namespace.Namespace
A namespace is a mapping of variable names to their values.
You can also think about a namespace as a sort of dictionary containing all dened variable names and the
corresponding reference to their values.Reference
A reference is a sort of pointer to a location in memory.
Therefore you do not have to knowwhere exactly your value can be found in memory,Python handles this for you
via variables.
Figure 2.1.NamespaceFigure 2.1 shows a representation of some namespace.Values which have not been referenced by a variable,are
EcoRI
gc
'GAATTC'
0.546
Memory space
105
Namespace
not accessible to you,because you can not access the memory space directly.So if a result of a calculation is
returned,you can use it directly and forget about it after that.Or you can create a variable holding this value and
then access this value via the variable as often as you want.
>>> from string import *11
Chapter 2.Variables>>> cds ="""atgagtgaacgtctgagcattaccccgctggggccgtatatcggcgcacaaa
tttcgggtgccgacctgacgcgcccgttaagcgataatcagtttgaacagctttaccatgcggtg
ctgcgccatcaggtggtgtttctacgcgatcaagctattacgccgcagcagcaacgcgcgctggc
ccagcgttttggcgaattgcatattcaccctgtttacccgcatgccgaaggggttgacgagatca
tcgtgctggatacccataacgataatccgccagataacgacaactggcataccgatgtgacattt
attgaaacgccacccgcaggggcgattctggcagctaaagagttaccttcgaccggcggtgatac
gctctggaccagcggtattgcggcctatgaggcgctctctgttcccttccgccagctgctgagtg
ggctgcgtgcggagcatgatttccgtaaatcgttcccggaatacaaataccgcaaaaccgaggag
gaacatcaacgctggcgcgaggcggtcgcgaaaaacccgccgttgctacatccggtggtgcgaac
gcatccggtgagcggtaaacaggcgctgtttgtgaatgaaggctttactacgcgaattgttgatg
tgagcgagaaagagagcgaagccttgttaagttttttgtttgcccatatcaccaaaccggagttt
caggtgcgctggcgctggcaaccaaatgatattgcgatttgggataaccgcgtgacccagcacta
tgccaatgccgattacctgccacagcgacggataatgcatcgggcgacgatccttggggataaac
cgttttatcgggcggggtaa""".replace("\n","")
>>> float(count(cds,'G') + count(cds,'C'))/len(cds)
0.54460093896713613
Here the result of the gc-calculation is lost.
>>> gc = float(count(cds,'G') + count(cds,'C'))/len(cds)
>>> gc
0.54460093896713613
In this example you can remember the result of the gc calculation,because it is stored in the variable gc.
2.5.Reassignment of variables
It is possible to reassign a new value to an already dened variable.This will destroy the reference to its former
value and create a new binding to the new value.This is shown in Figure 2.2.
Figure 2.2.Reassigning values to variables12
EcoRI
gc
'GAATTC'
0.546
Memory space
105
0.45
Namespace
Chapter 2.VariablesNote
In Python,it is possible to reassign a newvalue with a different type to a variable.This is called dynamic
typing,because the type of the variable is assigned dynamically.Note that this is not the case in all
programming languages.Sometimes,as in C,the type of variables is assigned statically and has to be
declared before use.This is some way more secure because types of variables can be checked only by
examining the source code,whereas that is not possible if variables are dynamically typed.13
Chapter 2.Variables14
Chapter 3.Statements,expressions and functionsChapter 3.Statements,expressions and functions
3.1.Statements
In our rst practical lesson,the rst thing we did,was the invocation of the Python interpreter.During the rst
session we entered statements that were read,analyzed and executed by the interpreter.Statement
Statements are instructions or commands that the Python interpreter can execute.Each statement is read by the
interpreter,analyzed and then executed.
3.2.Sequences or chaining statementsProgram
A program is a sequence of statements that can by executed by the Python interpreter.Sequence
Sequencing is a simple programming feature that allows to chain instructions that will be executed one by one
fromtop to bottom.
Later we are going to learn more complicated ways to control the ow of a program,such as branching and
repetition.
3.3.FunctionsFunction
Functions are named sequences of statements that execute some task.
We have already used functions,such as:
>>> type('GAATTC')
<type'str'>
>>> len(cds)
852
For example len is a function that calculates the length of things and we asked here for the length of our DNA
sequence cds.15
Chapter 3.Statements,expressions and functionsFunction call
Function calls are statements that execute or call a function.The Python syntax of function calls is the function
name followed by a comma separated list of arguments inclosed into parentheses.Even if a function does not take
any argument the parentheses are necessary.
Differences between function calls and variables.As variable names,function names are stored in a namespace
with a reference to their corresponding sequence of statements.When they are called,their name is searched in the
namespace and the reference to their sequence of statements is returned.The procedure is the same as for variable
names.But unlike them,the following parentheses indicate that the returned value is a sequence of statements that
has to be executed.That's why they are even necessary for functions which are called without arguments.Arguments of functions
Arguments are values provided to a function when the function is called.We will see more about themsoon.
3.4.OperationsOperations and Operators
Operations are basic functions with their own syntax.
They have a special Operator (a sign or a word) that is the same as a function name.Unary Operators,operations
which take one argument,are followed by their argument,and secondary operators are surrounded by their two
arguments.
Here are some examples:
>>>'GTnnAC'+'GAATTC'
'GTnnACGAATTC'
>>>'GAATTC'* 3
'GAATTCGAATTCGAATTC'
>>>'n'in'GTnnAC'
1
This is only a simpler way of writing these functions provided by Python,because humans are in general more
familiar with this syntax closely related to the mathematical formal language.
3.5.Composition and Evaluation of ExpressionsComposition and Expression
Composition is a way to combine functions.The combination is also called an Expression.16
Chapter 3.Statements,expressions and functionsWe have already used it.Here is the most complex example we have seen so far:
>>> float(count(cds,'G') + count(cds,'C'))/len(cds)
0.54460093896713613
What is happening when this expression is executed?The rst thing to say is that it is a mixed expression
of operations and function calls.Let's start with the function calls.If a function is called with an argument
representing a composed expression,this one is executed rst and the result value is passed to the calling function.
So the cds variable is evaluated,which returns the value that it refers to.This value is passed to the len function
which returns the length of this value.The same happens for the float function.The operation count(cds,
'G') + count(cds,'C') is evaluated rst,and the result is passed as argument to float.
Let's continue with the operations.There is a precedence list,shown in Table 3.1,for all operators,which
determines what to execute rst if there are no parentheses,otherwise it is the same as for function calls.So,
for the operation count(cds,'G') + count(cds,'C') the two count functions are executed rst on
the value of the cds variable and G and C respectively.And the two counts are added.The result value of
the addition is then passed as argument to the float function followed by the division of the results of the two
functions float and len.
Table 3.1.Order of operator evaluation (highest to lowest)OperatorName+x,-x,~xUnary operatorsx ** yPower (right associative)x * y,x/y,x % yMultiplication,division,modulox + y,x - yAddition,subtractionx << y,x >> yBit shiftingx & yBitwise andx | yBitwise orx < y,x <= y,x > y,x >= y,x == y,
x!= y,x <> y,x is y,x is not y,x
in s,x not in s<Comparison,identity,sequence membership testsnot xLogical negationx and yLogical andlambda args:exprAnonymous functionSo,as in mathematics,the innermost function or operation is evaluated rst and the result is passed as argument
to the enclosing function or operation.It is important to notice that variables are evaluated and only their values
are passed as argument to the function.We will have a closer look at this when we talk about function denitions
in Section 8.1.Exercise 3.1.Composition
Have a look at this example.Can you explain what happens?If you can't please read this section once again.17
Chapter 3.Statements,expressions and functions>>> from string import *
>>> replace(replace(replace(cds,'A','a'),'T','A'),'a','T')18
Chapter 4.Communication with outsideChapter 4.Communication with outside
4.1.Output
We saw in the previous chapter how to export information outside of the program using the print statement.
Let's give a little bit more details of its use here.
The print statements can be followed by a variable number of values separated by commas.Without a value
print puts only a newline character on the standard output,generally the screen.If values are provided,they
are transformed into strings,and then are written in the given order,separated by a space character.The line is
terminated by a newline character.You can suppress the nal newline character by adding a comma at the end of
the list.The following example illustrates all these possibilities:
#!/usr/local/bin/python
from string import *
dna ="ATGCAGTGCATAAGTTGAGATTAGAGACCCGACAGTA"
gc = float(count(dna,'G') + count(dna,'C'))/len(dna)
print gc
print"the gc percentage of dna:",dna,"is:",gc
print
print"the gc percentage of dna:",dna
print"is:",gc
print
print"the gc percentage of dna:",dna,
print"is:",gc
producing the following output:
caroline:~> python print_gc.2.py
0.432432432432
the gc percentage of dna:ATGCAGTGCATAAGTTGAGATTAGAGACCCGACAGTA is:0.432432432432
the gc percentage of dna:ATGCAGTGCATAAGTTGAGATTAGAGACCCGACAGTA
is:0.432432432432
the gc percentage of dna:ATGCAGTGCATAAGTTGAGATTAGAGACCCGACAGTA is:0.432432432432
caroline:~>
4.2.Formatting strings19
Chapter 4.Communication with outsideImportant
All data printed on the screen have to be character data.But values can have different types.Therefore
they have to be transformed into strings beforehand.This transformation is handled by the print
statement.
It is possible to control this transformation when a specic format is needed.In the examples above,the oat
value of the gc calculation is written with lots of digits following the dot which are not very signicant.The next
example shows a more reasonable output:
>>> print"%.3f"% gc
0.432
>>> print"%3.1f %%"% (gc*100)
43.2 %
>>> print"the gc percentage of dna:%10s...is:%4.1f %%."% (dna,gc*100)
the gc percentage of dna:ATGCAGTGCA...is:43.2 %
Figure 4.1 shows how to interpret the example above.The % (modulo) operator can be used to format strings.It
is preceded by the formatting template and followed by a comma separated list of values enclosed in parentheses.
These values replace the formatting place holders in the template string.A place holder starts with a %followed
by some modiers and a character indicating the type of the value.There has to be the same number of values and
place holders.20
Chapter 4.Communication with outsideFigure 4.1.Interpretation of formatting templatesTable 4.1 provides the characters that you can use in the formatting template and Table 4.2 gives the modiers of
(gc*100)
indicates that
a format
follows
f.13
%
the type of the
letter indicating
value to format
number of digits
following the
dot
of digits
total number
print
"%3.1f %%"
%
>>>
43.2 %
formatting string
values that will replace
the placholders
followed by a tuple of
formating string and
preceeded by the
percent operator
by parentheses
by commas and enclosed
they have to be separated
if there are more than one
format placeholder
values replacing the
the formatting character.Important
Remember that the type of a formatting result is a string and no more the type of the input value.
>>>"%.1f"% (gc*100)
'43.2'
>>> res ="%.1f"% (gc*100)
>>> at = 100 - res
Traceback (most recent call last):
File"<stdin>",line 1,in?
TypeError:unsupported operand type(s) for -:'int'and'str'
>>> res
'43.2'
Table 4.1.String formatting:Conversion charactersFormatting characterOutputExampleResult21
Chapter 4.Communication with outsided,idecimal or long integer"%d"% 10'10'o,xoctal/hexadecimal integer"%o"% 10'12'f,e,Enormal,'E'notation of
oating point numbers"%e"% 10.0'1.000000e+01'sstrings or any object that
has a str() method"%s"% [1,2,3]'[1,2,3]'rstring,use the repr()
function of the object"%r"% [1,2,3]'[1,2,3]'%literal %Table 4.2.String formatting:ModiersModierActionExampleResultname in parenthesesselects the key name in a
mapping object"%(num)d %(str)s"
% {'num':1,
'str':'dna'}'1 dna'-,+left,right alignment"%-10s"%"dna"'dna_______'0zero lled string"%04i"% 10'0010'numberminimumeld width"%10s"%"dna"'_______dna'.numberprecision"%4.2f"% 10.1'10.10'4.3.Input
As you can print results on the screen,you can read data from the keyboard which is the standard input device.
Python provides the raw_input function for that,which is used as follows:
>>> nb = raw_input("Enter a number,please:")
Enter a number,please:12
The prompt argument is optional and the input has to be terminated by a return.Important
raw_input always returns a string,even if you entered a number.Therefore you have to convert
the string input by yourself into whatever you need.Table 4.3 gives an overview of all possible type
conversion function.
>>> nb
'12'
>>> type(nb)
<type'str'>
>>> nb = int(nb)22
Chapter 4.Communication with outside>>> nb
12
>>> type(nb)
<type'int'>
Notice that a user can enter whatever he wants.So,the input is probably not what you want,and the type
conversion can therefore fail.It is careful to test before converting input strings.
>>> nb = raw_input("Please enter a number:")
Please enter a number:toto
>>> nb
'toto'
>>> int(nb)
Traceback (most recent call last):
File"<stdin>",line 1,in?
ValueError:invalid literal for int():toto
The following function controls the input:
def read_number():
while 1:
nb = raw_input("Please enter a number:")
try:
nbconv = int(nb)
except:
print nb,"is not a number."
continue
else:
break
return nb
and produces the following output:
>>> read_number()
Please enter a number:toto
toto is not a number.
Please enter a number:12
'12'
Table 4.3.Type conversion functionsFunctionDescriptionint(x [,base])converts x to an integerlong(x [,base])converts x to a long integerfloat(x)converts x to a oating-point number23
Chapter 4.Communication with outsidecomplex(real [,imag])creates a complex numberstr(x)converts x to a string representationrepr(x)converts x to an expression stringeval(str)evaluates str and returns an objecttuple(s)converts a sequence object to a tuplelist(s)converts a sequence object to a listchr(x)converts an integer to a characterunichr(x)converts an integer to a Unicode characterord(c)converts a character to its integer valuehex(x)converts an integer to a hexadecimal stringoct(x)converts an integer to an octal string24
Chapter 5.ProgramexecutionChapter 5.Programexecution
5.1.Executing code froma le
Until nowwe have only worked interactively during an interpreter session.But each time we leave our session all
denitions made are lost,and we have to re-enter them again in the next session of the interpreter whenever we
need them.This is not very convenient.To avoid that,you can put your code in a le and then pass the le to the
Python interpreter.Here is an example:
Example 5.1.Executing code froma le
Take the code for the cds translation as example and put it in a le named gc.py:
from string import *
cds ="""atgagtgaacgtctgagcattaccccgctggggccgtatatcggcgcacaaa
tttcgggtgccgacctgacgcgcccgttaagcgataatcagtttgaacagctttaccatgcggtg
ctgcgccatcaggtggtgtttctacgcgatcaagctattacgccgcagcagcaacgcgcgctggc
ccagcgttttggcgaattgcatattcaccctgtttacccgcatgccgaaggggttgacgagatca
tcgtgctggatacccataacgataatccgccagataacgacaactggcataccgatgtgacattt
attgaaacgccacccgcaggggcgattctggcagctaaagagttaccttcgaccggcggtgatac
gctctggaccagcggtattgcggcctatgaggcgctctctgttcccttccgccagctgctgagtg
ggctgcgtgcggagcatgatttccgtaaatcgttcccggaatacaaataccgcaaaaccgaggag
gaacatcaacgctggcgcgaggcggtcgcgaaaaacccgccgttgctacatccggtggtgcgaac
gcatccggtgagcggtaaacaggcgctgtttgtgaatgaaggctttactacgcgaattgttgatg
tgagcgagaaagagagcgaagccttgttaagttttttgtttgcccatatcaccaaaccggagttt
caggtgcgctggcgctggcaaccaaatgatattgcgatttgggataaccgcgtgacccagcacta
tgccaatgccgattacctgccacagcgacggataatgcatcgggcgacgatccttggggataaac
cgttttatcgggcggggtaa""".replace("\n","")
gc = float(count(cds,'g') + count(cds,'c'))/len(cds)
print gc
and now pass this le to the interpreter:
caroline:~/python_cours> python gc.py
0.54460093896713613The print statement is used to write a message on the screen.We will have a closer look at this statement
later (Section 4.1).25
Chapter 5.ProgramexecutionTip
You can name your le as you like.However,there is a convention for les containing python code to
have a py extension.
You can also make your le executable if you put the following line at the beginning of your le,indicating that
this le has to be executed with the Python interpreter:
#!/usr/local/bin/python
(Don't forget to set the x execution bit under UNIX system.) Now you can execute your le:
caroline:~/python_cours>./gc.py
0.54460093896713613
This will automatically call the Python interpreter and execute all the code in your le.
You can also load the code of a le in a interactive interpreter session with the -i option:
caroline:~/python_cours> python -i gc.py
0.54460093896713613
>>>
This will start the interpreter,execute all the code in your le and than give you a Python prompt to continue:
>>> cds
'atgagtgaacgtctgagcattaccccgctggggccgtatatcggcgcacaaatttcgggtgccgacctgacgcgcccgttaagcgataatcagtttgaacagctttaccatgcggtgctgcgccatcaggtggtgtttctacgcgatcaagctattacgccgcagcagcaacgcgcgctggcccagcgttttggcgaattgcatattcaccctgtttacccgcatgccgaaggggttgacgagatcatcgtgctggatacccataacgataatccgccagataacgacaactggcataccgatgtgacatttattgaaacgccacccgcaggggcgattctggcagctaaagagttaccttcgaccggcggtgatacgctctggaccagcggtattgcggcctatgaggcgctctctgttcccttccgccagctgctgagtgggctgcgtgcggagcatgatttccgtaaatcgttcccggaatacaaataccgcaaaaccgaggaggaacatcaacgctggcgcgaggcggtcgcgaaaaacccgccgttgctacatccggtggtgcgaacgcatccggtgagcggtaaacaggcgctgtttgtgaatgaaggctttactacgcgaattgttgatgtgagcgagaaagagagcgaagccttgttaagttttttgtttgcccatatcaccaaaccggagtttcaggtgcgctggcgctggcaaccaaatgatattgcgatttgggataaccgcgtgacccagcactatgccaatgccgattacctgccacagcgacggataatgcatcgggcgacgatccttggggataaaccgttttatcgggcggggtaa'
>>>cds="atgagtgaacgtctgagcattaccccgctggggccgtatatcggcgcacaaatttcgggtgccgacctgacgcgcccgtt"
>>>cds
'atgagtgaacgtctgagcattaccccgctggggccgtatatcggcgcacaaatttcgggtgccgacctgacgcgcccgtt'
>>>gc
0.54460093896713613Important
It is important to remember that the Python interpreter executes code fromtop to bottom,this is also true
for code in a le.So,pay attention to dene things before you use them.26
Chapter 5.ProgramexecutionExercise 5.1.Execute code froma le
Take all expressions that we have written so far and put themin a le.Important
Notice that you have to ask explicitly for printing a result when you execute some code froma le,while
an interactive interpreter session the result of the execution of a statement is printed automatically.So to
view the result of the translate function in the code above,the print statement is necessary in the
le version,whereas during an interactive interpreter session we have never written it.
5.2.Interpreter and Compiler
Let's introduce at this point some concepts of execution of programs written in high level programming languages.
As we have already seen,the only language that a computer can understand is the so called machine language.
These languages are composed of a set of basic operations whose execution is implemented in the hardware of
the processor.We have also seen that high level programming languages provide a machine-independent level
of abstraction that is higher than the machine language.Therefore,they are more adapted to a human-machine
interaction.But this also implies that there is a sort of translator between the high level programming language
and the machine languages.There exists two sorts of translators:Interpreter
An Interpreter is a program that implements or simulates a virtual machine using the base set of instructions of
a programming language as its machine language.
You can also think of an Interpreter as a program that implements a library containing the implementation of the
basic instruction set of a programming language in machine language.
An Interpreter reads the statements of a program,analyzes them and then executes them on the virtual machine
or calls the corresponding instructions of the library.Interactive interpreter session
During an interactive interpreter session the statements are not only read,analyzed and executed but the result of
the evaluation of an expression is also printed.This is also called a READ - EVAL - PRINT loop.Important
Pay attention,the READ - EVAL - PRINT loop is only entered in an interactive session.If you ask the
interpreter to execute code in a le,results of expression evaluations are not printed.You have to do this
by yourself.27
Chapter 5.ProgramexecutionCompiler
A Compiler is a program that translates code of a programming language in machine code,also called object
code.The object code can be executed directly on the machine where it was compiled.
Figure 5.1 compares the usage of interpreters and compilers.
Figure 5.1.Comparison of compiled and interpreted codeSo using a compiler separates translation and execution of a program.In contrast of an interpreted program the
Compiler
Interpreter
processor
source code
virtual machine
source code is translated only once.
The object code is machine-dependent meaning that the compiled program can only be executed on a machine
for which it has been compiled,whereas an interpreted program is not machine-dependent because the machine-
dependent part is in the interpreter itself.
Figure 5.2 illustrates another concept of program execution that tries to combine the advantage of more effective
execution of compiled code and the advantage of machine-independence of interpreted code.This concept is used
by the JAVA programming language for example and in a more subtle way by Python.28
Chapter 5.ProgramexecutionFigure 5.2.Execution of byte compiled codeIn this case the source code is translated by a compiler in a sort of object code,also called byte code that is
source code
Compiler
bytecode
Interpreter
virtual machine
processor
then executed by an interpreter implementing a virtual machine using this byte code.The execution of the byte
code is faster than the interpretation of the source code,because the major part of the analysis and verication
of the source code is done during the compilation step.But the byte code is still machine-independent because
the machine-dependent part is implemented in the virtual machine.We will see later how this concept is used in
Python (Section 15.1).29
Chapter 5.Programexecution30
Chapter 6.StringsChapter 6.Strings
So far we have seen a lot about strings.Before giving a summary about this data type,let us introduce a new
syntax feature.
6.1.Values as objects
We have seen that strings have a value.But Python values are more than that.They are objects.Object
Objects are things that know more than their values.In particular,you can ask them to perform specialized tasks
that only they can do.
Up to now we have used some special functions handling string data available to us by the up to now magic
statement from string import *.But strings themselves know how to execute all of themand even more.
Look at this:
>>> motif ="gaattc"
>>> motif.upper()
'GAATTC'
>>> motif
'gaattc'
>>> motif.isalpha()
1
>>> motif.count('n')
0
>>> motif ='GAATTC_'
>>> motif + motif
'GAATTC_GAATTC_'
>>> motif * 3
'GAATTC_GAATTC_GAATTC_'
At the rst glance this looks a little bit strange,but you can read the.(dot) operator as:ask object motif to do
something as:transform motif in an uppercase string (upper),ask whether it contains only letters (isalpha)
or count the number of n characters.
Objects as namespaces.How does it work?All objects have their own namespace containing all variable and
function names that are dened for that object.As already describeb in Section 2.4 you can see all names dened
for an object by using the dir function:
>>> dir(motif)
['__add__','__class__','__contains__','__delattr__','__eq__','__ge__',
'__getattribute__','__getitem__','__getslice__','__gt__','__hash__',
'__init__','__le__','__len__','__lt__','__mul__','__ne__','__new__',31
Chapter 6.Strings'__reduce__','__repr__','__rmul__','__setattr__','__str__','capitalize',
'center','count','decode','encode','endswith','expandtabs','find',
'index','isalnum','isalpha','isdigit','islower','isspace','istitle',
'isupper','join','ljust','lower','lstrip','replace','rfind','rindex',
'rjust','rstrip','split','splitlines','startswith','strip','swapcase',
'title','translate','upper']
The dot operator is used to access this namespace.It will look up in the namespace of the object for the name
following the dot operator.
>>> motif.__class__
<type'str'>
>>> motif.replace('a','A')
'gAAttc'
Variables and functions dened in object namespaces are called attributes and methods of the object.Attribute
An attribute is a variable dened in a namespace of an object,which can only be accessed via the object himself.Method
Methods are functions dened in a namespace of an object.
This is just a little introduction to objects making it possible to use the object syntax for the basic types in Python.
We will give further explanation into object-oriented programming in Chapter 17.
6.2.Working with stringsStrings
Strings are sequences or ordered collections of characters.
You can write them in Python using quotes,double quotes,triple quotes or triple double quotes.The triple quote
notation permits to write strings spanning multiple lines with keeping any line feed.
>>>'ATGCA'
'ATGCA'
>>>"ATGCA"
'ATGCA'
>>>"""ATGATA
...AGAGA"""32
Chapter 6.Strings'ATGATA\nAGAGA'
The rst thing that we would sum up is how to extract characters or substrings.Characters are accessible using
their position in the string which has to be enclosed into brackets following the string.The position number can
be positive or negative,which means starting at the beginning or the end of the string.Substrings are extracted by
providing an interval of the start and end position separated by a colon.Both positions are optional which means
either to start at the beginning of the string or to extract the substring until the end.When accessing characters,it
is forbidden to access position that does not exist,whereas during substring extraction,the longest possible string
is extracted.
>>> motif ='GAATTC'
>>> motif[0]
'G'
>>> motif[-1]
'C'
>>> motif[0:3]
'GAA'
>>> motif[1:3]
'AA'
>>> motif[:3]
'GAA'
>>> motif[3:]
'TTC'
>>> motif[3:6]
'TTC'
>>> motif[3:-1]
'TT'
>>> motif[-3:-1]
'TT'
>>> motif[:]
'GAATTC'
>>> motif[100]
Traceback (most recent call last):
File"<stdin>",line 1,in?
IndexError:string index out of range
>>> motif[3:100]
'TTC'
>>> motif[3:2]
Caution
Figure 6.1 compares positive and negative indices.Be careful,forward string indices starts always with
0,whereas backward string indices starts with -1.33
Chapter 6.StringsFigure 6.1.String indicesCaution
EcoRI[1:4] == EcoRI[−5:−2]
G
A A T T C
0
1 2 3 4 5
−6 −5 −4 −3 −2
−1
motif =
It is also important to notice that the character at the end position during a substring extraction is never
included in the extracted substring.Warning
Strings are immutable in Python.This means you can neither change characters or substrings.You have
always to create a new copy of the string.
>>> motif[1] ='n'
Traceback (most recent call last):
File"<stdin>",line 1,in?
TypeError:object doesn't support item assignment
>>> motif[:1] +'n'+ motif[2:]
'GnATTC'
A list of all other methods,function and operators and their action on string objects are summarized in Table 6.1
and Table 6.2).
Table 6.1.String methods,operators and builtin functionsMethod,Operator,FunctionDescriptions + tConcatenations * 3Repetitionlen(s)Returns the length of smin(s),max(s)Returns the smallest,largest character of s,depend-
ing on their position in the ASCII code34
Chapter 6.Stringss.capitalize()Capitalize the rst character of ss.center([width])Centers s in a eld of length widths.count(sub [,start [,end]])Counts occurrences of sub between start and ends.encode([encoding [,errors]])Encode s using encoding as code and errors.expandtabs([tabsize])Expands tabss.find(sub [,start [,end]])Finds the rst occurrence of sub between start and ends.index(sub [,start [,end]])same as find but raise an exception if no occurrence is
founds.join(words)Joins the list of words with s as delimiters.ljust(width)Left align s in a string of length widths.lower()Returns a lowercase version of ss.lstrip()Removes all leading whitespace characters of ss.replace(old,new [,maxrep])Replace maximal maxrep versions of substring old with
substring news.rfind(sub [,start [,end]])Finds the last occurrence of substring sub between start
and ends.rindex(sub [,start [,end]])Same as rfind but raise an exception if sub does not
existss.rjust(width)Right-align s in a string of length widths.rstrip()Removes trailing whitespace characterss.split([sep [,maxsplit]]))Split s into maximal maxsplit words using sep as separa-
tor (default whitespace)s.splitlines([keepends])Split s into lines,if keepends is 1 keep the trailing newlines.strip()Removes trailing and leading whitespace characterss.swapcase()Returns a copy of s with lowercase letters turn into up-
percase and vice versas.title()Returns a title-case version of s (all words capitalized)s.translate(table [,delchars])Translate s using translation table table and removing
characters in string delcharss.upper()Returns an uppercase version of sTable 6.2.Boolean methods and operators on stringsMethod or operatorDescriptions < <=,>=,> tChecks if s appears before,before or at the same point,
after or at the same point,after than t in an alphabetically
sorted dictionarys < <= t >=,> rChecks if r appears between s and t in an alphabetically
sorted dictionarys ==,!=,is,not is tChecks the identity or difference of s and tc in s,c not in sChecks if character c appears in ss.endswith(suffix [,start [,end]])Checks if s ends with sufxs.isalnum()Checks whether all characters are alphanumeric35
Chapter 6.Stringss.isalpha()Checks whether all characters are alphabetics.isdigit()Checks whether all characters are digitss.islower()Checks whether all characters are lowercases.isspace()Checks whether all characters are whitespaces.istitle()Checks whether s is title-case meaning all words are
capitalizeds.isupper()s.startswith(prefix [,start [,
end]]))Checks whether s starts with prex between start and end36
Chapter 7.Branching and DecisionsChapter 7.Branching and Decisions
7.1.Conditional execution
Sometimes the continuation of a programdepends on a condition.We would either execute a part of the program
if this condition is fullled or adapt the behavior of the programdepending on the truth value of the condition.Branching or conditional statements
Branching is a feature provided by programming languages that makes it possible to execute a sequence of
statements among several possibilities.
The simplest case of a conditional statement can be expressed by the if statement.
>>> from string import *
>>> seq ='ATGAnnATG'
>>> if'n'in seq:
...print"sequence contains undefined bases"
...nb = count(seq,'n')
sequence contains undefined bases
Figure 7.1 shows a general schema of a simple conditional statement.
Figure 7.1.Flow of execution of a simple conditionThe if statement has to be followed by a condition.Then a block is opened by a colon.This block contains the
condition
if block
only if condition is true
block of statements executed
true
false
condition
:
if
sequence of statements that has to be executed if the condition is fullled.Figure 7.3 and Figure 7.2 highlight the
structural parts of the if statement.37
Chapter 7.Branching and DecisionsFigure 7.2.If statementFigure 7.3.Block structure of the if statement7.2.Conditions and Boolean expressions
>>>
. . .
. . .
nb = count(seq, 'n')
'n' in seq
if
print "sequence contains undefined bases"
sequence contains undefined bases
sequence of instructions
ONLY executed if the condition is true
body of the if statement
:
condition
:
>>>
. . .
. . .
if 'n' in seq
nb = count(seq, 'n')
print "sequence contains undefined bases"
sequence contains undefined bases
header line
block inititiation
body of the if statement
block of code
same indentation to
indicate a block
The condition in the if statement has to be a boolean expression.Conditions or Boolean expressions
Conditional or boolean expressions are expressions that are either true or false.
Here are some examples of simple boolean expressions or conditions:
>>> 1 < 0
False
>>> 1 > 0
True
>>> 1 == 0
False
>>>'n'in'ATGCGTAnAGTA'
True
>>>'A'>'C'
False
>>>'AG'<'AA'38
Chapter 7.Branching and DecisionsFalse
>>>'AG'!='AC'
True
>>> len('ATGACGA') >= 10
False
In Python the value true is represented by 1 and the value false by 0.
Table 7.1 lists all boolean operators and their action on strings and numbers.
Table 7.1.Boolean operatorsOperatorAction on stringsAction on numbers<,<=,>=,>alphabetically sorted,lower/lower or
equal/greater or equal/greater thanlower/lower or equal/greater or
equal/greater than==,!=,is,is not
aidentityidentityin,not inmembership-a
We will explain the difference between these two identity operators later (Section 11.2).Important
Do not confuse the assignment sign = with the logical operator ==.The second one is used to compare
two things and check their equality whereas the rst one is used to bound values to variables.Python
does not accept an assignment as a condition but there are other programming languages that use the
same syntax for these two statements,but they do not warn when you use = instead of ==.
7.3.Logical operators
The three logical operators not,and and or enable you to compose boolean expressions and by this way to
construct more complex conditions.Here are some examples:
>>> seq ='ATGCnATG'
>>>'n'in seq or'N'in seq
True
>>>'A'in seq and'C'in seq
True
>>>'n'not in seq
False
>>> len(seq) > 100 and'n'not in seq
False
>>> not len(seq) > 100
True 39
Chapter 7.Branching and DecisionsCaution
The oroperation is not an exclusive or as it is sometimes used in the current languages.An orexpression
is also true if both subexpressions are true.
>>> seq ='ATGCnATG'
>>>'A'in seq or'C'in seq
True
7.4.Alternative execution
If the condition of a if statement is not fullled no statement is executed.
>>> seq ='ATGACGATAG'
>>> if'n'in seq:
print"sequence contains undefined characters"
>>>
An alternative sequence of statements,that will be executed if the condition is not fullled,can be specied with
the else statement.
>>> seq ='ATGACGATAG'
>>> if'n'in seq:
...print"sequence contains undefined bases"
...else:
print"sequence contains only defined bases"
sequence contains only defined bases
Here the if and else are followed by a block containing the statements to execute depending on the truth value
of the condition.In this case exactly one of themis executed,which is illustrated in Figure 7.440
Chapter 7.Branching and DecisionsFigure 7.4.Flow of execution of an alternative condition7.5.Chained conditional execution
condition
else block
true false
if
condition
:
if block
only if the condition is false
block of statements executed
else
:
only if condition is true
block of statements executed
In Python,you can specify more than one alternative:
>>> seq ='vATGCAnATG'
>>> base = seq[0]
>>> base
v
>>> if base in'ATGC':
...print"exact nucleotid"
...elif base in'bdhkmnrsuvwxy':
...print"ambiguous nucleotid"
...else:
...print"not a nucleotid"
...
ambiguous nucleotid
The elif statement is used to give an alternative condition.What happens when this is executed?The
conditions are evaluated from top to bottom.In our example with base ='v'as rst condition,the if
condition base in'ATGC'is false,so the next condition,that of the elif statement is evaluated.base
in'bdhkmnrsuvwxy'is true,so the block of statements of this clause is executed and ambiguous
nucleotid is printed.Then the evaluation of the condition is stopped and the ow of execution continues
with the statement following the if-elif-else construction.Multiple alternative conditions
Multiple alternative conditions are conditions that are tested from top to bottom.The clause of statements for
the rst alternative that is evaluated as true is executed.So there is exactly one alternative that is executed,even41
Chapter 7.Branching and Decisionsif there are more than one that are true.In this case the clause of the rst true condition encountered is chosen.
Figure 7.5 illustrates this.
Figure 7.5.Multiple alternatives or Chained conditionsThe else statement is optional.But it is more safe to catch the case where all of the given conditions are false.Exercise 7.1.Chained conditions
else
:
block of statements executed
only if all conditions are false
if block else blockelif block
condition
falsetrue
second
condition
falsetrue
only if second_condition is true
and condition is false
block of statements executed
elif
second_condition
:
only if condition is true
block of statements executed
if
:
condition
The elif statement only facilitates the writing and legibility of multiple alternative conditions.How would you
write a multiple condition without this statement (Solution 7.1)?
Hint:See the scheme of Figure 7.5.
7.6.Nested conditions
However,construction with multiple alternatives are sometimes not sufcient and you need to nest condition like
this:
>>> primerLen = len(primer)
>>> primerGC = float(count(primer,'g') + count(primer,'c'))/primerLen
>>> if primerGC > 50:
...if primerLen > 20:
...PCRprogram = 1
...else:
...PCRprogram = 242
Chapter 7.Branching and Decisions...else:
...PCRprogram = 3Exercise 7.2.Nested condition
Why is it impossible to write the above example as chained condition?
Figure 7.6 shows the scheme of nested conditions.
Figure 7.6.Nested conditionsSometimes you can simplify nested conditions by constructing more complex conditions with boolean operators.
if
condition
:
else
:
else
:
else
:
if
second_condition
if
third_condition
:
block of statements executed
only if condition and
second_condition are true
block of statements executed
only if condition is true
block of statements executed
only if condition is false and
third_condition is true
block of statements executed
only if condition is false and
third_condition is false
:
Why are they joint here first?
if_if block
second
condition
falsetrue
condition
falsetrue
falsetrue
else_if blockif_else block
condition
third
else_else
block
What is known about the second condition here?
>>> primerLen = len(primer)
>>> primerGC = float(count(primer,'g') + count(primer,'c'))/primerLen
>>> if primerGC > 50:
...if primerLen > 20:
...PCRprogram = 1
...else:
...PCRprogram = 243
Chapter 7.Branching and Decisionscan be expressed as:
>>> if primerGC > 50 and primerLen > 20:
...PCRprogram = 1
...else:
...PCRprogram = 2Caution
Even if the second version is easier to read,be careful and always check whether the complex condition
you have written,is what you really want.Such errors are called semantic error.They can not be detected
by the interpreter because the syntax of the programis correct,even if it is not necessarily what you want
to compute.
7.7.Solutions
Solution 7.1.Chained conditions
Exercise 7.1
Figure 7.7.Multiple alternatives without elif44
if
condition
:
block of statements executed
only if condition is true
block of statements executed
elif
second_condition
:
block of statements executed
else
:
only if the second condition
is true
only if all conditions are false
if
condition
:
block of statements executed
only if condition is true
else
:
if
second_condition
:
block of statements executed
only if all conditions are false
else
:
block of statements executed
only if the second condition
is true
Chapter 8.Dening FunctionsChapter 8.Dening Functions
8.1.Dening Functions
In Section 3.3 we have learnt how to apply or call functions.So let's remember the example calculating the
GC-percentage of a DNA sequence.
>>> float(count(cds,'G') + count(cds,'C'))/len(cds)
This calculates the gc percentage of the specic DNAsequence cds,but we can use the same formula to calculate
the gc percentage of other DNA sequences.The only thing to do is to replace cds by the new DNA sequence in
the formula.But it is not very convenient to remember the formula and retype it all the time.It would be much
easier to type the following instead.
>>> gc('ATGCAT')
0.33333333333333331
>>> gc(cds)
0.54460093896713613
The only thing we have to remember is the name of the new function and its use.Abstraction
The possibility to dene such new function executing tasks specied by yourself,is an abstraction feature,
provided by all high level programming languages.Important
It is also important to notice that functions have to be dened before they are called.You can not use
something that is not dened.
Here is the syntax of such a new denition in Python:
>>> from string import *
>>> def gc(seq):
...return float(count(seq,'G') + count(seq,'C'))/len(seq)
Let's have a closer look at this denition.Figure 8.1 illustrates the structure of a function denition.45
Chapter 8.Dening FunctionsFigure 8.1.Function denitionsdef and return are basic instructions.Basic instruction
during a function call the parameter
statement indicating a new function
that has to be followed by the function name
and a comma separated list of parameters
and a colon indication
the start of a new block
statement used to return back a result
indentation
defining a
block
of a calculation
definition in Python
:
>>> def gc(seq)
. . . return float(count(seq, 'G') + count(seq, 'C')) / len(seq)
second
prompt
indicating
a new
block
its argument value
(a sort of place holder) is replaced by
Basic instructions are statements that dene the language rules and the semantic of Python.They constitute the
basic set of instructions of Python.Each basic instruction has its own syntax that you have to learn in order to
master the programming language.
The return basic instruction is used to return the result of a function back,in our example the value of the GC
percentage of the specied DNA sequence.
The def basic instruction indicates to Python that a function denition follows.It has to be followed by the new
function name and a comma separated list of parameter names enclosed into parentheses.Parameter
Parameters are variable names.When the function is called they are bound in the same order to the arguments
given.
The body of a function contains the piece of code needed to execute the subtask of the function.In the example
above,the body contains only the return statement.Here is a more complex example that excludes ambiguous
bases fromthe GC percentage calculation.
>>> from string import *
>>> def gc(seq):
...nbases = count(seq,'N')
...gcpercent = float(count(seq,'G') + count(seq,'C'))/(len(seq) - nbases)
...return gcpercent
In this example the body of the function contains three instructions (2 assignments and the return statement).The
body of a function follows the denition line and is written as an indented block initiated by a colon.46
Chapter 8.Dening FunctionsBlock