Identification of Orthologous Gene Sequences

ticketdonkeyAI and Robotics

Nov 25, 2013 (3 years and 10 months ago)

82 views

Identification of Orthologous Gene Sequences


Background

Orthology and paralogy are important concepts in understanding how genomes evolve and how
biological novelty arises. Orthology

arises from the speciation process, and orthologous gene
sequences in different species are the product of vertical descent from a single gene in the most
recent common ancestor

of the two species in question (Fitch 1970). Paralogous gene
sequences are t
he result of a gene duplication event (Fitch 1970). It is important for biologists
to be able to recognize the difference between these two classes of gene sequences
, especially
when one wishes to re
-
construct an accurate phylogeny or determine the forces

of natural
selection.

There are numerous methods that attempt to identify orthologous from paralogous gene
sequences. Many are based on a simple approach called reciprocal best hit BLAST (RBHB)
searches. These methods use BLAST searches between sequence

data (DNA or amino acid)
from different species. BLAST searches are done reciprocally, meaning in a two species RBHB
search each species is used as the query and subject during the search process.
If you have
one pair of sequences from each of the two s
pecies that are each other’s ‘best’ hit from the
reciprocal search they can be considered orthologs. Inparanoid initially uses this approach.
However
,

it creates ‘cluster
s
’ or groups of genes that have high similarity based on
BLAST
scores. It then uses

bootstrapping (a statistical approach that is based on re
-
sampling the data)
to give confidence scores to the putative ortholog pair
. A low confidence score maight indicate
that you cannot find an ortholog for that particular set of genes in your data se
t.

Statement of Module Goals

By the end of this exercise you will be able to:

-
Utilize BLAST searches and a clustering algorithm to identify orthologous gene sequences

-
Align orthologous sequences from two species

-
Prepare these alignments for further anal
ysis

-
Estimate pairwise levels of sequence divergence between the pairs of orthologous sequences.

V & C Core Competencies

Ability to apply the process of science

Ability to use quantitative reasoning

Ability to tap into the interdisciplinary nature of
science

GCAT
-
SEEK Sequencing requirements

Data that can be used in this module include:

-
Assembled transcriptomes (RNA
-
SEQ) from any sequencing platform

-
Annotated gene sequences from genomic assemblies (any platform)

-
The ideal set of data will have the
open reading frame for the DNA sequences in one FASTA
file while the other FASTA file will have the corresponding translated amino acid sequences. If
you

have

transcriptome data that might contain UTRs you can use a program like ORF
-
Predictor (
http://prot
eomics.ysu.edu/tools/OrfPredictor.html
) that will give you the coding
sequence and the corresponding translated amino acid sequence.

There are numerous
methodologies that can be used to identify and extract coding sequences from transcriptomic or
genomic
data.

Protocols

Please check with your system administrator before installing anything on a shared computer.

For this module you will need to have the following programs installed:

BLAST, ClustalW, Ka_Ks_calculator

This can be done by typing

$sudo apt
-
get

install blast2

$sudo apt
-
get install clustallw

Ka_Ks_calculator is found on the following website (
http://code.google.com/p/kaks
-
calculator/
).
It is possible to
install this program onto a Linux
machine, however there are difficulties in
compiling the program. A suggestions, while more laborious but effective, is to install the
program on a Mac or PC. This is done easily from the download. You will have to move over
any files to use this progra
m, but at least it will work this way.

You will also need to have the following Perl scripts:

inparanoid.pl


this can be obtained from the following website:
http://software.sbc.su.se/cgi
-
bin/request.cgi?project=inparanoid

pal2nal.pl


provided here, but information can be found here:
http://www.bork.embl.de/pal2nal/

align2pairs.pl


provided here

Once you get the inparanoid package extract the files to a folder. I find it easier to

do

the initial
analyses in this folder.
T
hen copy the necessary files to another folder
for

further analysis.

In the extracted Inparanoid folder find the folder inparanoid_4.1/

The inparanoid.pl script should be in here
, along with numerous other files. Now would be a
good time to read the READM
E file
.
This file contains some basic information on the method
and useful parameters. There are a number of parameters you can change in the inparanoid.pl
file. We will leave the default parameters, but it might be useful to explore some of these
chang
es later. For instance you could use an ‘outgroup’ species to help define your orthologous
sequences. You could also use different protein substitution matrices, which might be
dependent on how closely related your species that you are comparing are.

Cop
y the two files you want to perform the orthology analysis
in to this folder
. This initial
analysis is initially done on protein sequences. The files you will need are:

Ssaxicola_pep.fas

Sgoodei_pep.fas


Type the following at the command prompt:

$perl in
paranoid.pl Ssaxicola_pep.fas Sgoodei_pep.fas

You should not see any errors if the analysis is done correctly. A number of files will be
produced. Let

s explore them.

o
rthologs.Ssaxicola_pep.fas_Sgoodei_pep.fas.html

This is an html file that contains the

results of the inparanoid clustering. All significant ortholog
groups are reported. For each set of orthologs the sequences are given and the
confidence for
each group is

given. How many different ortholog groups are found for this data set?

The same
information is given in a text file: output.Ssaxicola_pep.fas_Sgoodei_pep.fas.html

An additional file that will of
u
se to us is: table.
Ssaxicola_pep.fas_Sgoodei_pep.fas.html. This
information is also given in sql

format (sqltable.Ssaxicola_pep.fas_Sgoodei_pep.fas.html) if you
are verse
d

in sql. Open table

table.
Ssaxicola_pep.fas_Sgoodei_pep.fas.html. Each lin
e
contains the ortollog group (1
st column), the score, the sequence identifier from file A,

and

the
seque
nce identifier from file B. The last two are followed by confidence scores. We need to
extract the sequence identifiers for each set of sequences so we can align them for further
analyses. This is probably most easily done in a spreadsheet program.
Cop
y them into a file so
that it looks like this (tab delimited):

Seq_A_1

Seq_B_1

Seq_A_2

Seq_B_2



And save the file as IDs.txt. Please note that this is a very simple dataset. ‘Real world’ data
may contain ortholog groups that contain more than one sequen
ce from a particular species.
This is especially true if you happen to sequence genes from a gene family, have alternatively
spliced genes, or if you have incomplete genes (for example from transcriptomes) in your data
set. If you do get this you might w
ant to filter your orthologs so that there is only one sequence
from each species found.

Once you have the IDs.txt file copy it to another folder.

Also copy Ssaxicola_pep.fas,
Sgoodei_pep.fas Ssaxicola_cds.fas, and Sgoodei_cds.fas to that folder. What w
e are going to
be doing next is creating a larger file that contains all the protein sequences (protnr) and all the
coding sequence in another (nucnr). This can easily be done in a text editor or if you want to
impress someone with your Unix skills try:

$
cat Ssaxicola_pep.fas Sgoodei_pep.fas > protnr

$cat Ssaxicola_cds.fas Sgoodei_cds.fas > nucnr

In order for our alignment script to work we need to format these files for a BLAST search. This
is why I had you do this analysis in another folder,
it’s

going
to get crowded fast. To format
these for a BLAST search do the following:

$formatdb

i protnr p T

o T

$formatdb

i nucnr p F

o T

If everything goes well you should not have any errors. Now we can search the databases and
pull out sequences for each
ortholog pair.

$perl align2pairs.pl

You should now have 49 files starting with ‘contig’ and 49 files starting with
‘pal2nal’. You can
alter the script to change the output file names. The contig files contain the aligned orthologs
based on amino acid seq
uence and the pal2nal files contain aligned nucleotide sequences
based on the amino acid alignments.
The pal2nal_all.fasta contains all of the alignments in a
single FASTA file.

There is an additional program that can be used to make pairwise comparisons
for all of your
alignments, it

s called Ka_Ks_calculator (
http://code.google.com/p/kaks
-
calculator/
). This
program is a little more difficult to get running on a Linux machine, but it does have a nice GUI
that runs in Windows or Mac. To use this we need
to convert either the pal2nal_all.fasta

or
each individual pal2nal file to what is known as AXT format. This is easily done on the
pal2nal_all.fasta file by typing the following on the command line:

$
grep
-
i
-
v Ssax pal2nal_all.fasta > Ssax_Sgoo.axt

This
command removes all lines from the file pal2nal_all.fasta that contain ‘Ssax’ and copies
everything to a new file named ‘Ssax_Sgoo.axt’.

$perl

p

i.bak

e ‘s/
\
>lcl
\
|//g’ Ssax_Sgoo.axt

This
command is a perl one liner that deletes “>lcl|” from each instanc
e it occurs

in
Ssax_Sgoo.axt
. You could also use find and replace in a text editor

if you wanted to
.
The

i.bak command creates a backup of the original file, just in case.

The backup file is named
Ssax_Sgoo.axt
.bak.

Now type:

$perl

p

i.bac

e ‘s/Sg/
\
nSg/g’ Ssax_Sgoo.axt

This perl one liner takes each instance of Sg (the sequence identifier in this case) and replaces
it with a carriage return

and a Sg
. Basically putting a return between each pair of sequences. It
also puts one at the beginning of the

file that you will have to remove manually.
A backup file
named
Ssax_Sgoo.axt
.bac is also created.

The file Ssax_Sgoo.axt can now be opened in Ka_Ks_calculator.

Open the Ka_Ks_calculator
GUI and import your .axt file. Select a simple model for analysis
, let

s say YN (check the box),
and
press Run. You should get something like this.


The analysis preformed contains pairwise estimated of synonymous and non
-
synonymous
substitutions for all ortholog

pairs. This serves a number of functions. One is for quality control.
If you do have othrologs and you do have good alignments your Ks values should be below 0.1
for all comparisons.
Values greater than this might indicate bad alignments or alignments

of
non
-
orthologous genes. This also tells us information about the ratio of non
-
synonymous to
synonymous substitutions (Ka/Ka). Most genes will have a values <1, indicating purifying
selection. However genes that have a value greater than 1 might be
subject to positive
Darwinian selection. You can export this table and open it in a spreadsheet program for further
analysis.

Assessment

Students will be asked to conduct the outlined analysis. Additional assessment could come in
the form of a lab repo
rt where students are asked to present results of the orhtology analysis
and the pairwise sequence comparisons. Visual examples of pairwise alignments (good and
bad) could be presented as well as a table/graph of Ka/Ks values.

Timeline of Module

The modul
e should be performed in a typical 2
-
3 hour lab session.

Discussion Topics

Along the way students should recognize that not all genes from each data set will form
significant orthologous pairs. Why is this? This particular data set is from two transcript
omes, it
is possible that not all genes ar
e present, or that orthologs ha
ve been sequenced in one data
set or the other.
What would students expect to see if whole genome data sets were utilized?
How would students deal with instances where multiple para
logs were identified? Students
should also explore the alignments that automatically produced by pal2nal. Are they adequate?
Is there another method to improve this (muscle, MAFFT)? Does this method still require some
manual checks along the way? Las
tly, students should try to interpret the patterns of selection.

Lecture Topics

Students should be introduced into the basic concepts of molecular evolution and genomics.
Pre
-
lab lecture topics should include the basic principles of how genes and genomes evolve
(mutation, recombination, natural selection, genome/gene duplication). Ideally there would also
be a previous introduction of transcriptome/genome assembly

and anno
tation and the basic
principles behind searching large sequence databases (e.g. BLAST).

References

Berglund AC, Sjolund E, Ostlund G and Sonnhammer ELL

(2008)

InParanoid 6: eukaryotic
or
tholog clusters with inparalogs
Nucleic Acids Res. 36:D263
-
266

Fitch

W. (1970).
Distinguishing hom
ologous from analogous proteins
.
Syst Zool

19

(2): 99

113


O'Brien Kevin P, Remm Maido and Sonnhammer Erik L.L

(2005)
Inparanoid: A Comprehensive
D
atabase of Eukaryotic Orthologs

N
ucleic
A
cids
R
research

33:D476
-
D480

Ostlund

G, Schmitt T, Forslund K, Kostler T, Messina DN, Roopra S, Frings O and
Sonnhammer ELL
(2009)
InParanoid 7: new algorithms and tools for eukaryotic ortholog
y
analysis

Nucleic Acids Res. 38:D196
-
D203

Remm

M
, Storm

CEV
, and Sonnhammer

ELL (2001)
Automatic

clustering of orthologs and in
-
paralogs fr
om pairwise species comparisons

J.

Mol. Biol. 314:1041
-
1052

Suyama

M
, Torrents

D
, and Bork

P

(2006)

PAL2NAL: robust conversion of protein sequence
alignments into the corresponding codon alignments.

Nucleic Acids

Res.

34
, W609
-
W612.

Zhang Z, Li J, Zhao XQ, Wang J, Wong GK, Yu J: KaKs Calculator: Calculating Ka and Ks
through model selection and model averaging.
Genomics Proteomics Bioinformatics

2006 ,
4:259
-
263

http://www.bork.embl.de/pal2nal/

http://code.google.com/p/kaks
-
calculator/