GABOS=Get A Bit Of Sequence/GAFEP-Get - WEHI Bioinformatics

underlingbuddhaΒιοτεχνολογία

2 Οκτ 2013 (πριν από 4 χρόνια και 1 μήνα)

96 εμφανίσεις

1


Keith Satterley, Bioinformatics Division, WEHI

Bioinformatics Seminar 13/11/07

2

Summary:

GABOS =
G
et
A

B
it
O
f
S
equence.

GAFEP =
G
et
A

F
ew
E
xon
P
rimers.



Functions and Facilities:



WEB interface.



Command Line Interface.


Data Management:


Genome data.


Result data.


Tools Used:


Perl


HTML


PHP


Javascript


Availability.


Future Work.

3


GABOS version 1 is at

http://unix28.alpha.wehi.edu.au/bioinformatics/gabos


WEB Page version 1 limitations
:


Exons, DNA, Transcripts available.


Genomes are a hard coded list of latest version data only.


Annotation File is a hard coded list covering all genomes.


Chromosome selection was a list of the common chromosome filenames.


Data Files Availability


All data has been downloaded from UCSC’s download site. It is described at:


http://hgdownload.cse.ucsc.edu/downloads.html

and can be ftp downloaded from:


ftp://hgdownload.cse.ucsc.edu/goldenPath/


Genome data is stored on the WEHI Disk Server accessible from:


WEHI Unix computers


/home/users/lab0605/Bioinformatics/databases/genomes/UCSC


WEHI Windows computers


map a network drive to:


\
\
unix33
\
bioinformatics


WEHI Macintoshes


Connect to Server at:


smb://unix33/Bioinformatics


4


Genomes at WEHI:


Jul 24 01:05 canFam
-
> canFam2


Jul 23 15:16 canFam1


Jul 23 15:16 canFam2


Jul 22 01:17 danRer
-
> danRer4


Jul 23 10:33 danRer3


Jul 23 15:20 danRer4


Nov 6 01:10 dm
-
> dm3


Nov 5 16:37 dm3


Jul 22 01:17 galGal
-
> galGal3


Jul 20 17:27 galGal2


Jul 23 10:11 galGal3


Jul 22 01:17 hg
-
> hg18


Jul 23 10:29 hg17


Jul 23 10:29 hg18


Aug 24 01:10 mm
-
> mm9


Jul 23 10:30 mm7


Aug 23 14:50 mm8


Aug 23 18:12 mm9


Jul 22 01:17 monDom
-
> monDom4


Jul 23 10:32 monDom4


Jul 22 01:17 panTro
-
> panTro2


Jul 23 10:32 panTro1


Jul 23 10:33 panTro2


Jul 22 01:17 rheMac
-
> rheMac2


Jul 25 02:32 rheMac2


Jul 22 01:17 rn
-
> rn4


Jul 23 10:33 rn3


Jul 23 10:33 rn4


More can be downloaded as requested.

5


Chromosome data Files:



Aug 23 14:09 chr9_random.fa



Aug 23 14:09 chrM.fa



Aug 23 14:09 chrUn_random.fa



Aug 23 14:14 chrX.fa



Aug 23 14:14 chrX_random.fa



Aug 23 14:14 chrY.fa



Aug 23 14:16 chrY_random.fa




Jul 23 16:11 chr9.fa



Jul 23 16:11 chrM.fa



Jul 23 16:13 chrNA_random.fa



Jul 23 16:14 chrUn_random.fa



Jul 23 16:14 md5sum.txt



Jul 23 16:14 README.txt



Jul 23 16:16 scaffoldNA_random.fa



Jul 23 16:16 scaffoldUn_random.fa




Jun 22 04:05 chr2L.fa



Jun 22 04:05 chr2LHet.fa



Jun 22 04:05 chr2R.fa



Jun 22 04:05 chr2RHet.fa



Jun 22 04:05 chr3L.fa



Jun 22 04:05 chr3LHet.fa


Annotation Data Files:

6


Data Management:


Amount of data:


How many genomes local?


currently 10 = 96GB.


19 Vertebrates available + 9 sequence only.


15 Insects, 5 Nematodes + 4 others available.


How many versions of each? mm7, mm8, mm9?


2 or 3 of each?


Chromosome data: 10
-
50 per genome.


Annotation data: 5
-
10 per genome version


RefSeq, genscan, mgc, xenoRef, uniGene, refFlat,


EST’s. mRNA’s …


Up to date data!


Tool currently being written to nightly check UCSC


Download, unpack and sort annotation files.

7


GABOS Sequence Retrieval Features


Specify Search Criteria as either:


Gene Name List


as in Annotation Files

»
NM_001037759,NM_145692, NM_027033, NM_013715 as in
RefSeq.txt

»
Sgk3, 4930418G15Rik, Cops5, Sulf1 as in
RefFlat.txt


Chromosome Sequence Range specification.


Chr10:
13,500,000
-

14,550,000


This will select all genes in this region that are defined in
the annotation file(s) specified.


Exons

(
incl. EST exons
),
Transcripts

of Genes or
straight
DNA

sequence can be retrieved.


Specify either strand or both strands.

8


Extra Sequence Parameters


Range of bases in data object
(for e.g. bps in an Exon)


1
-
e

= all, base 1 to the
e
nd base (the default)


1
-
10

= bases 1 to 10


10
-
e

= base 10 to
e
nd base in object.


Range of objects requested.
(for e.g. a range of Exons)


1
-
e

= all exons (the default)


1
-
3

= exons 1 to 3.


1

= first exon only


e

= last exon only


Possible Extensions


(e
-
3)
-
e = last three objects (or bases)

9


GABOS Extras:


Specify the line length of the FASTA output file.


Output Sequence Lines ONLY.


Output Fasta Description Lines ONLY.


Concatenate ALL Sequences.


Concatenate ONLY Sequence from a DNA object (Each gene’s
exons concatenated for example).


String of characters to be inserted BEFORE each DNA object.


String of characters to be inserted AFTER each DNA object.


Specify flanking bases.


Show co
-
ordinates relative to: Chromosome, Exon, Transcript


Uses either RefSeq or Browser gene names in refFlat.txt



GAFEP (Get a Few Exon Primers)


Use output of GABOS to find primers around each exon.

10

GABOS Command Line Version (CLI).


Same code
. Program detects environment and adjusts
accordingly.


CLI use of GABOS caters for programmatic use of the
tool as part of other tasks.


For eg. Collecting 5000 bases before a transcript and 5000 into
the transcript to be used for promoter/regulation searching for
thousands of genes.

CLI Eg.

gabos
-
afile refFlat.txt
-
genome mm9
-
seqrange 4,482,560
-
4,483,185


-
chr 1
-
pre 420
-
post 420

fastaonly >my_results.fa


Options can be in any order. Output can be redirected to a file as shown.

A file of gene names could be used as input instead of a chromosome sequence range.


gabos

help


lists all options.

11


CLI additional abilities:.


Gene lists read from a file or piped in.


Debugging options available.


Specification of alternate locations for:

(enables use of program at other sites without modification.)



Annotation files.


Genome data files.


Checks if data files are latest version and updates
if not (
To be replaced with upgraded procedure
).

12

GABOS Command Line options:


-
addend:s,


-
addstart:s,


-
dna:s,


-
basedir:s,


-
genome:s


-
afile:s,


-
adir:s,


-
gdir:s,


-
check!


-
name:s,


-
namep:s,


-
namef:s,


-
chr:s,


-
seqrange:s,


-
strand:s,


-
dataobject:s,


-
objectrange:s,


-
baserange


-
seqonly,


-
fastaonly,


-
linelength:i,


-
relative:s,


-
pre:i


-
post:i


-
v!


-
debug1:i,


-
debug2:i,


-
debug3:i,


-
debug4:i,


-
debug5:i,


-
debug6:i,


-
debugall:i,


-
h|help|?,


-
version



All GAFEP programs can also be
run at the command line.


In particular:

Combine_overlapping_exons,

Create_primers1,

Create_primers2 ,

Makep3i,

P3out2tab.

13


Demo of GABOS version 2.

http://unix28.alpha.wehi.edu.au/bioinformatics/gabos/testing_index.php


Improvements:


Automatically reads genomes available:


Automatically shows chromosome data for
genome selected.


Automatically shows Annotation data files for
genome selected.


Includes ability to read EST data files.


Uses alternate gene name in refFlat.txt.


Faster processing of large data files using/making
presorted versions.

14


GAFEP = Get A Few Exon Primers.

This is a suite of programs.

1.
Combines overlapping exons into one “CExon”.

2.
Displays Primer3 options and collects choices.

3.
Creates input files for Primer3 in the required
format.

4.
Runs Primer3, displays output on the web page
and reformats the output suitable for pasting into
Excel.

5.
The same code runs from the web interface or
from a Command Line Interface.

15

Combining Exons to reduce number of primers needed.

1

2

3

4

5

6

7

CExon

CExon

Exon

16

120bp

Short Exons

120

90

90

300

Pad out short exons

to 300 bp.

Primers

in flanks

90

90

440

70

70

Add a 70 bp. cushion

120

90

90

70

70

200

200

840

Add 200bp

flanks

120

17

900bp

Long Exons

Primers

in flanks

Add 200bp

flanks

200

200

1025

485

70

70

70bp overlap

485bp

485bp

Split

485

625

70

70

Add a 70 bp. cushion

18


Demonstration of GAFEP

19

GAFEP Output

20

21

An example application:


Ben Kile’s lab are using GABOS/GAFEP to
create primers to search for variations in
sequence caused by the ENU mutations in
mice.

22

Random chemical mutagenesis in the mouse

Alkylating agent


Point mutagen


Efficiently mutates mouse
spermatogonial stem cells









Male mice treated with ENU produce offspring heterozygous for
ENU
-
induced mutations at the rate of 1 mutation per 1.5 megabases



N
-
ethyl
-
N
-
nitrosourea (ENU)

23

Phenotyping screen: measuring platelet number

Platelet counts

Platelet count x10
3
/uL

Plt16

and
Plt20

cause

dominant thrombocytopenia

Mutant offspring

Blood test

24

Mapping strategy for dominant mutations

m

m

m

m

m

X

X

F1 Generation

F2

Generation

Affected

C57BL/6

Balb/c

m

Wild
-
type

2nd Outcross

1st Outcross

Affected

Unaffected

25

Mapping strategy for dominant mutations

1.

Genome
-
wide scan with 80
-
100 microsatellites




20 affected and 20 unaffected animals




Result: mutation assigned to a chromosome


2. Fine mapping




200
-
1,000 informative meioses, genotyped with SSLPs at increasing density




Result: candidate interval refined to 1
-
3 Mb



Issues

Recombination cold spots

Polymorphism deserts

SNP density map of mouse chromosome 1

(C57BL/6 v 129Sv)

26

Candidate intervals

Chromosome 2: 20
-
21 Mb

Chromosome 11: 70
-
71 Mb

Heaven

Hell

27

Candidate gene sequencing

Prioritize candidates for sequencing on the basis of:


Known function

Homology to other genes of known function

Tissues expression pattern

Domain structure

Exhaustive literature searches…..





28

Robotic liquid handling

2. Genomic PCR

3. Direct amplicon

sequencing

4. Capillary


electropheresis

1. Automated PCR primer design

5. Sequence analysis

In
-
well template clean
-
up

Candidate gene sequencing

29


Tools used to develop GABOS/GAFEP


Perl programming language for all programs.


Web interface


HTML coding


PHP


inserted into HTML and processed by the
webserver before the HTML is processed by the
webserver.


Javascript


processed by the clients web
browser (Mozilla Firefox or Safari for example)

30

Unix Server = unix28

Webserver = apache

Client = Mac, Windows.

Browser = Firefox,IE …

Display of

GABOS/GAFEP

here

Genome

DATA

unix33

UCSC

nfs

ftp

Unix28 disk

GABOS/GAFEP

wan/lan

Javascript acts here

In response to user

html produced here

php processed here

html processed here

WEHI Computing Layout

31


Web Interface Debugging tools


Firefox Error Console


Firebug Addin to Firefox

32


Future Work:


Short term:


Finalize GABOS version 2


Transcript, DNA working


Complete data download maintenance program


Automate sorting of annotation files and modify GABOS to be
aware of sorted/non
-
sorted data and act accordingly.


Include ability to retrieve RNA data


Will run on any unix server


not just unix28.


Web Interface available on WEHI’s public server.


Source code will be made freely available.


Longer Term:


Retrieve data for utrs, others?


Provide web interface access to annotation files.


Remove need for BioPerl to be installed.

33

Aknowledgements:


Bioinformatics Division


Terry Speed & Gordon Smyth for the opportunity to pursue this
project in an excellent environment.


All others in Bioinformatics for many and varied help.


WEHI ITS


Nick Tan, Jakub Szarlat for Unix help.


Dung Tran, Scott Wood for network help.


Tri Le and John Nguyen for MS windows support.


Tony Kyne & others in ITS for many questions answered.


Molecular Medicine


Doug Hilton, Ben Kile for explaining their needs.


Users for their feedback.


Kylie Greig, Adrienne Hilton, Greg Hather, Carolyn de Graaf …