Identification of Orthologous Gene Sequences

ticketdonkeyAI and Robotics

Nov 25, 2013 (3 years and 8 months ago)


Identification of Orthologous Gene Sequences


Orthology and paralogy are important concepts in understanding how genomes evolve and how
biological novelty arises. Orthology

arises from the speciation process, and orthologous gene
sequences in different species are the product of vertical descent from a single gene in the most
recent common ancestor

of the two species in question (Fitch 1970). Paralogous gene
sequences are t
he result of a gene duplication event (Fitch 1970). It is important for biologists
to be able to recognize the difference between these two classes of gene sequences
, especially
when one wishes to re
construct an accurate phylogeny or determine the forces

of natural

There are numerous methods that attempt to identify orthologous from paralogous gene
sequences. Many are based on a simple approach called reciprocal best hit BLAST (RBHB)
searches. These methods use BLAST searches between sequence

data (DNA or amino acid)
from different species. BLAST searches are done reciprocally, meaning in a two species RBHB
search each species is used as the query and subject during the search process.
If you have
one pair of sequences from each of the two s
pecies that are each other’s ‘best’ hit from the
reciprocal search they can be considered orthologs. Inparanoid initially uses this approach.

it creates ‘cluster
’ or groups of genes that have high similarity based on
scores. It then uses

bootstrapping (a statistical approach that is based on re
sampling the data)
to give confidence scores to the putative ortholog pair
. A low confidence score maight indicate
that you cannot find an ortholog for that particular set of genes in your data se

Statement of Module Goals

By the end of this exercise you will be able to:

Utilize BLAST searches and a clustering algorithm to identify orthologous gene sequences

Align orthologous sequences from two species

Prepare these alignments for further anal

Estimate pairwise levels of sequence divergence between the pairs of orthologous sequences.

V & C Core Competencies

Ability to apply the process of science

Ability to use quantitative reasoning

Ability to tap into the interdisciplinary nature of

SEEK Sequencing requirements

Data that can be used in this module include:

Assembled transcriptomes (RNA
SEQ) from any sequencing platform

Annotated gene sequences from genomic assemblies (any platform)

The ideal set of data will have the
open reading frame for the DNA sequences in one FASTA
file while the other FASTA file will have the corresponding translated amino acid sequences. If


transcriptome data that might contain UTRs you can use a program like ORF
Predictor (
) that will give you the coding
sequence and the corresponding translated amino acid sequence.

There are numerous
methodologies that can be used to identify and extract coding sequences from transcriptomic or


Please check with your system administrator before installing anything on a shared computer.

For this module you will need to have the following programs installed:

BLAST, ClustalW, Ka_Ks_calculator

This can be done by typing

$sudo apt

install blast2

$sudo apt
get install clustallw

Ka_Ks_calculator is found on the following website (
It is possible to
install this program onto a Linux
machine, however there are difficulties in
compiling the program. A suggestions, while more laborious but effective, is to install the
program on a Mac or PC. This is done easily from the download. You will have to move over
any files to use this progra
m, but at least it will work this way.

You will also need to have the following Perl scripts:

this can be obtained from the following website:

provided here, but information can be found here:

provided here

Once you get the inparanoid package extract the files to a folder. I find it easier to


the initial
analyses in this folder.
hen copy the necessary files to another folder

further analysis.

In the extracted Inparanoid folder find the folder inparanoid_4.1/

The script should be in here
, along with numerous other files. Now would be a
good time to read the READM
E file
This file contains some basic information on the method
and useful parameters. There are a number of parameters you can change in the
file. We will leave the default parameters, but it might be useful to explore some of these
es later. For instance you could use an ‘outgroup’ species to help define your orthologous
sequences. You could also use different protein substitution matrices, which might be
dependent on how closely related your species that you are comparing are.

y the two files you want to perform the orthology analysis
in to this folder
. This initial
analysis is initially done on protein sequences. The files you will need are:



Type the following at the command prompt:

$perl in Ssaxicola_pep.fas Sgoodei_pep.fas

You should not see any errors if the analysis is done correctly. A number of files will be
produced. Let

s explore them.


This is an html file that contains the

results of the inparanoid clustering. All significant ortholog
groups are reported. For each set of orthologs the sequences are given and the
confidence for
each group is

given. How many different ortholog groups are found for this data set?

The same
information is given in a text file: output.Ssaxicola_pep.fas_Sgoodei_pep.fas.html

An additional file that will of
se to us is: table.
Ssaxicola_pep.fas_Sgoodei_pep.fas.html. This
information is also given in sql

format (sqltable.Ssaxicola_pep.fas_Sgoodei_pep.fas.html) if you
are verse

in sql. Open table

Ssaxicola_pep.fas_Sgoodei_pep.fas.html. Each lin
contains the ortollog group (1
st column), the score, the sequence identifier from file A,


nce identifier from file B. The last two are followed by confidence scores. We need to
extract the sequence identifiers for each set of sequences so we can align them for further
analyses. This is probably most easily done in a spreadsheet program.
y them into a file so
that it looks like this (tab delimited):





And save the file as IDs.txt. Please note that this is a very simple dataset. ‘Real world’ data
may contain ortholog groups that contain more than one sequen
ce from a particular species.
This is especially true if you happen to sequence genes from a gene family, have alternatively
spliced genes, or if you have incomplete genes (for example from transcriptomes) in your data
set. If you do get this you might w
ant to filter your orthologs so that there is only one sequence
from each species found.

Once you have the IDs.txt file copy it to another folder.

Also copy Ssaxicola_pep.fas,
Sgoodei_pep.fas Ssaxicola_cds.fas, and Sgoodei_cds.fas to that folder. What w
e are going to
be doing next is creating a larger file that contains all the protein sequences (protnr) and all the
coding sequence in another (nucnr). This can easily be done in a text editor or if you want to
impress someone with your Unix skills try:

cat Ssaxicola_pep.fas Sgoodei_pep.fas > protnr

$cat Ssaxicola_cds.fas Sgoodei_cds.fas > nucnr

In order for our alignment script to work we need to format these files for a BLAST search. This
is why I had you do this analysis in another folder,

to get crowded fast. To format
these for a BLAST search do the following:


i protnr p T

o T


i nucnr p F

o T

If everything goes well you should not have any errors. Now we can search the databases and
pull out sequences for each
ortholog pair.


You should now have 49 files starting with ‘contig’ and 49 files starting with
‘pal2nal’. You can
alter the script to change the output file names. The contig files contain the aligned orthologs
based on amino acid seq
uence and the pal2nal files contain aligned nucleotide sequences
based on the amino acid alignments.
The pal2nal_all.fasta contains all of the alignments in a
single FASTA file.

There is an additional program that can be used to make pairwise comparisons
for all of your
alignments, it

s called Ka_Ks_calculator (
). This
program is a little more difficult to get running on a Linux machine, but it does have a nice GUI
that runs in Windows or Mac. To use this we need
to convert either the pal2nal_all.fasta

each individual pal2nal file to what is known as AXT format. This is easily done on the
pal2nal_all.fasta file by typing the following on the command line:

v Ssax pal2nal_all.fasta > Ssax_Sgoo.axt

command removes all lines from the file pal2nal_all.fasta that contain ‘Ssax’ and copies
everything to a new file named ‘Ssax_Sgoo.axt’.




e ‘s/
|//g’ Ssax_Sgoo.axt

command is a perl one liner that deletes “>lcl|” from each instanc
e it occurs

. You could also use find and replace in a text editor

if you wanted to

i.bak command creates a backup of the original file, just in case.

The backup file is named

Now type:




e ‘s/Sg/
nSg/g’ Ssax_Sgoo.axt

This perl one liner takes each instance of Sg (the sequence identifier in this case) and replaces
it with a carriage return

and a Sg
. Basically putting a return between each pair of sequences. It
also puts one at the beginning of the

file that you will have to remove manually.
A backup file
.bac is also created.

The file Ssax_Sgoo.axt can now be opened in Ka_Ks_calculator.

Open the Ka_Ks_calculator
GUI and import your .axt file. Select a simple model for analysis
, let

s say YN (check the box),
press Run. You should get something like this.

The analysis preformed contains pairwise estimated of synonymous and non
substitutions for all ortholog

pairs. This serves a number of functions. One is for quality control.
If you do have othrologs and you do have good alignments your Ks values should be below 0.1
for all comparisons.
Values greater than this might indicate bad alignments or alignments

orthologous genes. This also tells us information about the ratio of non
synonymous to
synonymous substitutions (Ka/Ka). Most genes will have a values <1, indicating purifying
selection. However genes that have a value greater than 1 might be
subject to positive
Darwinian selection. You can export this table and open it in a spreadsheet program for further


Students will be asked to conduct the outlined analysis. Additional assessment could come in
the form of a lab repo
rt where students are asked to present results of the orhtology analysis
and the pairwise sequence comparisons. Visual examples of pairwise alignments (good and
bad) could be presented as well as a table/graph of Ka/Ks values.

Timeline of Module

The modul
e should be performed in a typical 2
3 hour lab session.

Discussion Topics

Along the way students should recognize that not all genes from each data set will form
significant orthologous pairs. Why is this? This particular data set is from two transcript
omes, it
is possible that not all genes ar
e present, or that orthologs ha
ve been sequenced in one data
set or the other.
What would students expect to see if whole genome data sets were utilized?
How would students deal with instances where multiple para
logs were identified? Students
should also explore the alignments that automatically produced by pal2nal. Are they adequate?
Is there another method to improve this (muscle, MAFFT)? Does this method still require some
manual checks along the way? Las
tly, students should try to interpret the patterns of selection.

Lecture Topics

Students should be introduced into the basic concepts of molecular evolution and genomics.
lab lecture topics should include the basic principles of how genes and genomes evolve
(mutation, recombination, natural selection, genome/gene duplication). Ideally there would also
be a previous introduction of transcriptome/genome assembly

and anno
tation and the basic
principles behind searching large sequence databases (e.g. BLAST).


Berglund AC, Sjolund E, Ostlund G and Sonnhammer ELL


InParanoid 6: eukaryotic
tholog clusters with inparalogs
Nucleic Acids Res. 36:D263


W. (1970).
Distinguishing hom
ologous from analogous proteins
Syst Zool


(2): 99


O'Brien Kevin P, Remm Maido and Sonnhammer Erik L.L

Inparanoid: A Comprehensive
atabase of Eukaryotic Orthologs




G, Schmitt T, Forslund K, Kostler T, Messina DN, Roopra S, Frings O and
Sonnhammer ELL
InParanoid 7: new algorithms and tools for eukaryotic ortholog

Nucleic Acids Res. 38:D196


, Storm

, and Sonnhammer

ELL (2001)

clustering of orthologs and in
paralogs fr
om pairwise species comparisons


Mol. Biol. 314:1041


, Torrents

, and Bork



PAL2NAL: robust conversion of protein sequence
alignments into the corresponding codon alignments.

Nucleic Acids


, W609

Zhang Z, Li J, Zhao XQ, Wang J, Wong GK, Yu J: KaKs Calculator: Calculating Ka and Ks
through model selection and model averaging.
Genomics Proteomics Bioinformatics

2006 ,