Gene ontology

hordeprobableBiotechnology

Oct 4, 2013 (3 years and 8 months ago)

67 views






What is a wrapper?



There is no formal definition for this. I define it this way:


A short script that calls an existing program (executable),

parses the
result(s
) and then save the final result in a file.

Wrapper is more specific to Perl because other languages

a
re awkward/clumsy to do this.



Why we need a wrapper?


We do not need to
re
-
invent
the wheel from the scratch.

Wrapper

@
blastparsed

= `echo $
oligo

| /
usr
/local/
biobin
/
blastall

-
p
blastn

-
d
$
subjectf

-
F
F


W 6
-
g F
-
a 4 | /
usr
/local/biobin.dev/blast
-
parser.pl`;


}

How to run blast and parse the BLAST output and then save the result in an array


#!
c:/Perl/perl.exe

-
w



use strict;

my (@line, @parsed, $temp);


my $
infile
=shift;

my $output=shift;

my $
ussage
="Ussage:
\
n$0 <
input_file
> <$output>
\
n
";

unless ($
infile

&& $output){


print "$
ussage
";


exit;

}


open (OUT, "$output") || die "Can not open input file
--

$output
\
n
";

open (IN, "$
infile
") || die "Can not open input file
--

$
infile

\
n
";



(…. Continued )







while (<IN>) {


s/
\
r
\
n?/
\
n
/; # remove return




chomp;


$temp = $_;



@line = split "
\
t
",$_;


qx(rm

$
oligo
);



open (TMP,">$
oligo
") || die "Can't open
tmp

file";


print TMP ">$line[0]
\
n$line[1]
\
n";


close(TMP
);



qx(mfold

SEQ=$
oligo

NA=DNA T=43 NA_CONC=0.6 W=2 MAX=30);


$temp=$
opt_t

. "." . "out";


@parsed=
qx(perl

parse_mfold_result.pl

-
i

$temp);


print OUT "@parsed";


}

close(IN
);

close(OUT
);




With the high
-
throughput sequencing technologies (e.g.
Solexa
, 454), we now can produce a few terabytes of
sequence data per day in a single lab.


Exponential increase of the amount of genomic sequence
from various species need to be annotated.


Bioinformatics solutions are increasingly required to develop
automatic annotation techniques to support and complement
the manual
curation

process

The use of Perl for gene annotation

The generic structure of an automatic genome annotation pipeline and delivery system

(Cited from
Haili

Ping)

Automation of gene and genome
annotation pipelines




Primary goal is to deliver highly accurate and reliable gene and genome
annotations using the widest range of evidence from existing literatures and
databases.




Essence : pipelines should contain suites of bioinformatics software tools that can
interact with multiple databases, and integrate various related information to for a
given gene for genome.




Trend :

Consensus
-
based approaches combined results of gene predictors and similarity
search methods are used

Automated annotation pipelines




EBI/Sanger Institute
Ensembl

Project:
http://www.ensembl.org/Homo_sapiens/





NCBI Human Genome Browser:





http://proxy.library.uiuc.edu:3367/genome/guide/human/





The Oak Ridge National Laboratories Genome Channel:

http://compbio.ornl.gov/channel/





Celera Discovery System:
http://cds.celera.com/





Incyte

Genomics
¯

Genomics Knowledge Platform:

http://www.incyte.com/incyte_science/technology/gkp/





Paracel

GeneMatcher2 System:
http://www.paracel.com/products/gm2.html

Human genome browsers




UCSC Human Genome Browser:
http://genome.cse.ucsc.edu/cgi
-
bin/hgGateway/





Softberry

Genome Explorer:
http://www.softberry.com/berry.phtml?topic=genomexp





Viaken

Enterprise
Ensembl

Solution:

http://www.viaken.com/ns/solutions/ensembl.html





LabBook

Inc. Genomic Explorer Suite:

http://www.labbook.com/products/ExplorerSuite.asp





University of Tokyo Gene Resource Locator Browser:
http://grl.gi.k.u
-
tokyo.ac.jp/

Other useful sites




The Institute for Genomic Research (TIGR):
http://www.tigr.org/





Human Genome Central:
http://www.ensembl.org/genome/central/

and

http://proxy.library.uiuc.edu:3528/genome/central/


From raw sequence to gene predictions





Raw sequence pre
-
processing


masking known repeats and low
comlexity

sequences using


RepeatMasker



identifying homology matches using BLAST



Scans for other features, such as sequence tagged site (STS)


markers and
CpG

islands




Gene prediction



Predictions based on protein matches



Predictions based on DNA sequence



Ab

initio gene prediction programs

A simplified schematic of algorithmic gene prediction

The Reference Sequence (
RefSeq
)
collection aims to provide a
comprehensive, integrated, non
-
redundant, well
-
annotated set of
sequences, including genomic DNA,
transcripts, and proteins.


Gene Function Characterization





Mapping
to known genes


RefSeq

and SWISS
-
PROT


Human Genome Organization (HUGO)

(NCBI,UCSC and Ensemble)



Protein
domain annotation


Pam, PRINTS, PROSITE,
ProDom
, BLOCKS and SMART
.



Interpro

project :creating a unique characterization for a given protein family,
domain or functional site. Domains of the protein sequences can then be identified
using this signature method. The use of
Interpro

provides the least
-
redundant and
extensive annotation currently
available




Gene
ontology

Gene Ontology (GO) project aims at defining such common terms to specify molecular
function, biological process and cellular location




Future opportunities


Comparative genomics

As more genomes are sequenced and become publicly available in the next
few years, comparative genomics will become one of the greatest areas of
development



Cross
-
species Analysis : human
-
mouse

Protein coding genes are likely to be highly conserved between closely
related species (e.g. mouse and human), and other regions, such as
RNA genes and regulatory regions, could also be elucidated



need for the development of bioinformatics tools





the integration of such tools with the current automated



approaches the design of genome browsers and websites that can


intelligently display and annotate comparative results

References :


1.
Genome annotation techniques: new approaches and
challenges,Drug

Discovery Today, Volume 7, Issue 11, 6 May 2002, Pages 570
-
576 Alistair G. Rust,
Emmanuel
Mongin

and Ewan Birney Loraine AE,
Helt

GA.

2.Discovering new genes with advanced homology detection, Trends in
Biotechnology, Volume 20, Issue 8, 1 August 2002, Pages 315
-
316
Weizhong

Li
and Adam
Godzik


3.Biswas M, O'Rourke JF,
Camon

E, Fraser G,
Kanapin

A,
Karavidopoulou

Y,
Kersey P,
Kriventseva

E,
Mittard

V,
Mulder

N,
Phan

I, Servant F,
Apweiler

R.
Applications of
InterPro

in protein annotation and genome analysis. Brief
Bioinform
. 2002 Sep;3(3):285
-
95. PMID: 12230037 [
PubMed

-

in process]
http://www.ebi.ac.uk/interpro/

4.Visualizing the genome: techniques for presenting human genome data and
annotations. BMC Bioinformatics. 2002 Jul 30;3(1):19.
http://www.pubmedcentral.gov/articlerender.fcgi?tool=pubmed&pubmedid=1
2149135

5.Oshiro G,
Wodicka

LM, Washburn MP, Yates JR 3rd, Lockhart DJ,
Winzeler

EA.
Parallel identification of new genes in
Saccharomyces

cerevisiae
. Genome Res.
2002 Aug;12(8):1210
-
20. PMID: 12176929 [
PubMed

-

indexed for MEDLINE]
http://www.genome.org/cgi/content/full/12/8/1210