Bioinformatics techniques COD+EK Nov07

abalonestrawΒιοτεχνολογία

2 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

111 εμφανίσεις

1
/18

Bioinformatics tools and techniques

Into the heart of darkness

Elaine Kenny

Colm O’Dushlaine


15/11/07

2
/18

Summary


Simple overviews of some of the tools and methods used by EK and
CO’D


TK notebook


get_hapmap_snps.pl: retrieve HM genotype information for a list of
SNPs


GeneViewer.pl & cross_ref.pl: visualise e.g. SNPs in the context of
other genomic landmarks. Score SNPs depending on how many of
these landmarks they overlap with


ld_expander.pl: find SNPs in LD with SNPs of interest, based on
user
-
specified r
2
and “LD window” (distance between SNPs)


STATA


VIM: command line text editor


Lab website

3
/18

TK notebook


Application for saving notes, to
-
do lists, daily
logs, and any other kind of textual information
in a place where you can find it all again, and
where related information is easily found


Easy to edit and rapidly searchable


DEMO


editing


DEMO


search


4
/18

get_hapmap_snps.pl


Simple script to read in a 1
-
column list of
SNPs and retrieve HapMap genotypes


Can select population and strand


DEMO


Retrieved data can be loaded into HaploView


DEMO

5
/18

cross_ref_scored.pl


Score SNPs based on how many putatively functional regions
they overlap with:


On a per gene / chromosome basis


Gene basis:


Type: perl cross_ref_scored.pl file_A file_B file_C ...

where


file_A
-

2
-
column file of SNPs (format = id, location)


file_B
-

3
-
column file of EXONS (format = id/name, start, stop)


file_C ...
-

whatever you want, (format = id/name, start, stop)


i.e. other regions like CpGs, TFBS, clusters. Any order.








6
/18

cross_ref_scored.pl example output:

Can then be merged with HapMap / Perlegen to retrieve MAF data
for SNPs

7
/18

Merge cross_ref_scored data with HapMap/
Perlegen data using merge_per_hap.pl


Type:


perl merge_per_hap.pl perlegen.txt hapmap.txt overlapped_region_scored.txt


Where:

hapmap.txt = 3
-
column file (format:
rsid, ref_allele, ref_allele_freq),

perlegen.txt = 3
-
column file (format: rsid, ref_allele, ref_allele_freq)

8
/18

cross_ref.pl applied to WGA data


cross_ref.pl: Scoring SNPs throughout genome


Data analysed on coding/non
-
coding basis


(coding)


perl cross_ref.pl

Overlapped_regions_scored.WTCCC.chr22.coding.txt 22

WTCCC_T2D_chr22_without_inferred.forCrossRef
WGA_databases/coding_non_synon_SNPs_UCSC.clean
=3

WGA_databases/coding_synon_SNPs_UCSC.clean
=2

WGA_databases/RefSeq_Genes_UCSC.byExon.uniqid
=1

WGA_databases/Triplexes_may2006.bed
=2

WGA_databases/splice_site_SNPs_UCSC.clean
=2

>

Overlapped_regions_scored.WTCCC.chr22.coding.log
&


(
input
-
dependent,

coding/non
-
coding dependent
,
arbitrary
)


(noncoding)


perl cross_ref.pl Overlapped_regions_scored.WTCCC.chr22.NONcoding.txt 22
WTCCC_T2D_chr22_without_inferred.forCrossRef WGA_databases/TFBS.chr22=1
WGA_databases/CpG_islands_UCSC.uniqid=1
WGA_databases/Most_conserved_phastConsElements17way_UCSC.clean=1
WGA_databases/promoters_knowngene_hg18.txt=1 WGA_databases/sno_or_miRNA_UCSC.uniqid=1 >
Overlapped_regions_scored.WTCCC.chr22.NONcoding.log &

9
/18

cross_ref.pl


cross_ref.pl output:






Load into STATA. If SNPs have e.g.
association p
-
values, calculate adjusted p
-
value (R. Anney) as




-
log
10
[P] + [cross_ref_score]


10
/18

GeneViewer.pl


GeneViewer.pl: Visualise overlapping
features (e.g. exons, SNPs etc.) along e.g.
your gene of interest (html output)



11
/18

ld_expander.pl


Find proxies (SNPs in LD) for a list of SNPs


User specifies the r
2
and “LD window”


Currently configured to obtain proxies from HM CEU


Result is a list of additional proxy SNPs that have
been obtained by LD expansion


DEMO


Note: don’t LD expand >150000 SNPs, or HapMap
will ban you! CO’D has an alternative version that
uses local pre
-
computed pairwise LD SNP files

12
/18

STATA


Extremely powerful and flexible


>65k rows handled


shock horror!


Can write scripts to automate tasks, e.g. read in file,
do analysis, save results


When use GUI to run some commands, the
commands are shown in the command window, so
can save in a do file


CO’D, EK and R. Anney strongly advocate this as a
platform for both file manipulation and statistical
analysis

13
/18

http://www.wtccc.org.uk/

STATA example using WTCCC data

Bipolar Disorder,

Coronary Artery Disease,

Crohn's Disease,

Hypertension,

Rheumatoid Arthritis,

Type 1 Diabetes,

Type 2 Diabetes


14
/18

DATA FORMAT


3 folders:


Basic


Each case collection against the pooled control groups
58C and UKBS


Combined cases


Combining other case collections as controls


Combined controls


Combining phenotypically relevant case collections


(e.g. RA/T1D, autoimmune )


Data are split by chromosome

15
/18

Questions


How do I get all of the chromosome data for
my gene of interest into one file?



How do I search easily all of the SNP
information for my gene(s) of interest?



Create a “.do” file for all manipulations that you
want to carry out to the data


DEMO



Good starting resource:
http://www.ats.ucla.edu/stat/stata/

16
/18

VIM


“Vi Improved
”. Mainly UNIX but cross
-
platform text editor (available for Windows).


Full list of commands outside scope of this
demonstration


Very

fast and efficient, esp. with search and
replace functions on large datasets


Regular expression pattern matching


DEMO


Integrates with Cygwin (
www.cygwin.com



very useful UNIX emulator for windows)

17
/18

Group website


Some useful stuff up there!


Please send information about current
projects etc. Good for our image as a group
and minimal effort required on your part


DEMO

18
/18

Conclusions


Small summary of some things you can do


Slides and video demonstrations will be online at:
http://www.medicine.tcd.ie/psychiatry/research/neurop
sychiatry/Protocols/



CO’D & EK available for advice

(Friday’s 9
-
9.02am)


These things will help you in your work!!