# Bioinformatics Using Python_001x - bio-bio-1

Problem No 1

Determinaiton of GC Content

Among the four nucleotides, {A,T,C,G}, the ratio of C & G over a DNA sequence carries some very
important signals. This ratio is measured through “GC
-
content %” using the following formula.

GC
-
content % =
((n(G)+n(C))/(len(DNA)))*100%

Where

n(G) = number of G in the sequence

n(C) = number of C in the sequence

len(DNA)= length of the DNA sequence in base
-
pair(bp)

write a python program that can perform as bellow.

Input

1.

DNA sequence as a string

Output

1.

L
ength of the sequence

2.

GC
-
content %

Example

Input

ATCG

Output

Length of the seqeunce = 4 bp

GC
-
content % = 50%

2

Problem No
2

Complement DNA strand

DNA forms the double helix structure with two strands of DNA. Though when we work with DNA
seqeunce, we usually talk about a single DNA
-
sequence (single strand). But in the chromosome DNA
remains in a double stranded form. These two strands are called com
plement of each other. One is
named as 5’
-
3’ (forward strand) and other is named as 3’
-
5’(reverse strand). When it is
not
explicitly
mentioned the strand type (or direction), it is assumed that the respective DNA sequence is of 5’
-
3’ or
forward strand.

5

---
ACCGTA
---
3’

| | ||| |

3’
---
TGGCAT
---
5’

In a complement DNA strand each base of the original DNA sequence is replaced by the following inter
-
changing rule
-

A is replaced by T and vice
-
versa

C is replaced by G and vice
-
versa

This is because,
in the double helix structure
A of one strand is connected with T of other strand with
hydrogen bond and same in the case of C & G.

write a python program that can perform as bellow.

Input

1.

DNA sequence as a string

Output

1.
Complement of input DNA
sequence

Example

Input

ACCGTA

Output

Complement DNA Sequence

=
TGGCAT

3

Problem No
3

Reverse
Complement
of a
DNA
Sequence

This problem can be thought as an extension of Problem No 2

(read Problem No 2 first)
.
In
bioinformatics analysis the concept of
Reverse Complement DNA sequence is very often encountered. If
the complement of a DNA sequence is reversed, this is called reverse
-
complement of the original DNA
sequence.

5’
---
ACCGTA
---
3’

| | ||| |

3’
---
TGGCAT
---
5’

Here the complement of (5’
---
ACCGTA
---
3’) is (3’
---
TGGCAT
---
5’), and reverse of (3’
---
TGGCAT
---
5’) is
TACGGT, so the reverse complement of ACCGTA is
TACGGT.

write a python program that can perform as bellow.

Input

1.

DNA sequence as a string

Output

1.
Reverse
Complement of input DNA
sequence

Example

Input

ACCGTA

Output

Reverse
Complement DNA Sequence =
TACGGT

4

Problem No
4

Codon List from a DNA sequence

Triplets of nucleotides (for example ATT, TCG, CCC, etc) are called Codons. Through the process of
Transcription and Translation these Codons of a DNA sequence become responsible to produce an
amino acid individually. And finally chain of amino acids build
s a protein.
64 (4x4x4) different codons are
possible.

Lets think of a DNA sequence as
ATTTCGAGGT
. If we start parsing codons from left to right, the possible
codons will be
ATT
,
TCG
,
AGG

(ignore the
right

most remaining part

with length <3 bp
, in this ca
se
T
).

Write a function
/python program

that returns the list of codons for a DNA sequence
. This program
should return/print the list of codons as the “list” data structure of python.

Input

1.

DNA sequence as a string

Output

1.
List of Codons

Example

Inp
ut

ATTTCGAGGT

Output

Codon
-
List

=
[‘ATT’,’TCG’,’AGG’]

5

Problem No
5

Translate a DNA Sequence

Each codon represents an amino acid (skim through Problem No 4). The standard Codon
-
To
-
Amino Acid
mapping table is called the “Standard Genetic Code Table” or “Codon
-
Table”. This is built for codons
derived from RNA (detail will discussed in separate space

beyond this problem), as a result you will find
U

T
.
But for the simplicity, in this specific problem definition you should use the customized
(for DNA) genetic code table.

Standard Genetic Code

U

C

A

G

U

UUU

UUC

UUA

UUG

UCU

UCC

UCA

UCG

UAU

UAC

UAA

UAG

UGU

UGC

UGA

UGG

U

C

A

G

C

CUU

CUC

CUA

CUG

CCU

CCC

CCA

CCG

CAU

CAC

CAA

CAG

CGU

CGC

CGA

CGG

U

C

A

G

A

AUU

AUC

AUA

AUG

ACU

ACC

ACA

ACG

AAU

AAC

AAA

AAG

AGU

AGC

AGA

AGG

U

C

A

G

G

GUU

GUC

GUA

GUG

GCU

GCC

GCA

GCG

GAU

GAC

GAA

GAG

GGU

GGC

GGA

GGG

U

C

A

G

Phe
(F)

Leu (L)

Val

(V)

Ser (S)

Tyr (Y)

Stop

Cys (C)

Leu

(L)

Pro

(P)

Gln

(Q)

His

(H)

Arg

(R)

Ile

(I)

Thr

(T)

Lys

(K)

Asn

(N)

Ser (S)

Arg

(R)

Asp

(D)

Ala

(A)

Glu

(E)

Gly

(G)

Met

(M)

Trp (W)

Stop

6

Customized (for DNA)

Genetic Code

ttt: F tct: S tat: Y tgt: C

ttc: F tcc: S tac: Y tgc: C

tta: L tca: S taa: * tca: *

ttg: L tcg: S tag: * tcg: W

ctt: L cct: P cat: H cgt: R

ctc: L ccc: P cac: H cgc: R

cta: L cca: P caa: Q cga: R

ctg: L ccg: P cag: Q cgg: R

att: I act: T aat: N agt: S

atc: I acc: T aac: N ag
c: S

ata: I aca: T aaa: K aga: R

atg: M acg: T aag: K agg: R

gtt: V gct: A gat: D ggt: G

gtc: V gcc: A gac: D ggc: G

gta: V gca: A gaa: E gga: G

gtg: V gcg: A gag: E ggg: G

There are 20 different amino acids. Detial table is as bellow.

20 Amino A
cids and Their Codes

1
-
Letter code

3
-
Letter Code

Name

1

A

Ala

Alanine

2

R

Arg

Arginine

3

N

Asn

Asparagine

4

D

Asp

Aspartic acid

5

C

Cys

Cysteine

6

Q

Gln

Glutamine

7

E

Glu

Glutamic acid

8

G

Gly

Glycine

9

H

His

Histidine

10

I

Ile

Isoleucine

11

L

Leu

Leucine

12

K

Lys

Lysine

13

M

Met

Methionine

14

F

Phe

Phenylalanine

7

15

P

Pro

Proline

16

S

Ser

Serine

17

T

Thr

Threonine

18

W

Trp

Thryptophan

19

Y

Tyr

Tyrosine

20

V

Val

Valine

Write a function
/program that takes a
DNA sequence and returns
/prints

the translated protein
sequence

(using the customized codon table, and representing amino acids using 1
-
letter codes)
.

Ignore
right
-
most incomplete codon of length <3 bp, as explained in Problem No 4.

Input

1.

DNA sequence as a string

Output

1.
Amino Acids Sequence of Protein

Example

I
nput

TTTCCTAATC

Output

Protein Sequence

=
FPN