R packages

arghtalentΔιαχείριση Δεδομένων

31 Ιαν 2013 (πριν από 4 χρόνια και 6 μήνες)

130 εμφανίσεις

“ I think you should be more explicit here in step two”

Figure omitted because of copyright reason

A printed version can be found at


Leung YF, Lam DSC, Pang CP. The miracle of microarray data analysis.
Genome Biol. 2001 Aug 29; 2: 4021.1
-
4021.2.


~ Normal science consists largely of mopping
-
up operations. Experimentalists carry out
modified versions of experiments that have
been carried out may times before ~


Thomas S. Kuhn


The FAQ of biologist:

What is the best microarray
analysis software?

Different kinds of microarray
software


Image analysis software


Data mining software


Statistics software


R packages for microarray analysis


SNPs analysis software


Database/ LIMS software


Public Expression Database


Primer design


Software for further data mining: annotation,
promoter analysis & pathway reconstruction

Softwares won’t discuss today



Hardware control softwares


Arrayer controlling


ArrayMaker


Scanner controlling/ Image acquisition

A statistics on current microarray
softwares

28 Feb 2002

Jan 2001

Image analysis

17

17

Data mining

39

R packages

14

SNP analysis

1

Database/ LIMS

14

4

Public Database

16

8

Accessory

8

-

Further data mining

9

-

Total

116

29

* Extracted from http://ihome.cuhk.edu.hk/~b400559/arraysoft.html

Image analysis software


Spot recognition


Segmentation


Foreground calculation


Background calculation


Spot quality measures


Major Image analysis softwares



AIDA array


ArrayPro



ArrayVision


Dapple


F
-
scan


GenePix Pro 3.0.5


ImaGene 4.0


Iconoclust


Iplab


Lucidea Automated
Spotfinder


Phoretix Array3


P
-
scan


QuantArray 3.0


ScanAlyze 2


Spot


TIGR Spotfinder


UCSF Spot


Examples of common used image
analysis software


ScanAlyze 2 (Mike Eisen, LBNL)


GenePix Pro 3.0.5 (Axon Instruments)


QuantArray 3.0 (Packard Instrument)


ImaGene 4.0 (Biodiscovery)

Spot recognition


ArrayPro from Media Cybernetics


Automate and fast grid, subgrid and spot finding
algorithms

Segmentation


Purpose


classification between foreground
and background


Fixed circle


Adaptive circle


Adaptive shape


Histogram method

Segmentation



Using extra dye


DAPI, avoid morphology
assumption


UCSF Spot

Spot quality measure


E.g. QuantArray 3.0


Diameter


Spot Area


Footprint


Circularity


Spot Signal/Noise


Spot Uniformity


Background Uniformity


Replicate Uniformity


Problem: lacking rigorous spot quality definition
and experimental verification

Future Image analysis software



Rigorous quality mearsures definition


Extra dye for better segmentation


Automated analysis


Data mining software


Main purposes

1.
Filtering and normalization

2.
Statistical inference of differentially
expressed genes

3.
Identification of biologically meaningful
patterns, i.e. expression profile; expression
fingerprint/ signature

4.
Visualization

5.
Other analysis like pathway reconstruction
etcs.

Different categories



Turnkey system


Comprehensive software


Specific analysis software


Extension/ accessory of other software


Major data mining software


AIDA Array


AMADA


ANOVA program for microarray
data


ArrayMiner


arraySCOUT


ArrayStat


BRB ArrayTools


CHIPSpace


Cleaver


CIT


CLUSFAVOR


Cluster


Cyber T


DNA
-
arrays analysis tools


dchip


Expression Profiler


Expressionist


Freeview & FreeOView


Gene Cluster



GeneLinker Gold


GeneMaths


GeneSight


GeneSpring


Genesis


Genetraffic


J
-
Express


MAExplorer


Partek


R cluster


Rosetta Resolver


SAM


SpotFire Decision Site


SNOMAD


TIGR ArrayViewer


TIGR Multiple Experiment Viewer


TreeView


Xcluster


Xpression NTI




Turnkey system


Definition: A computer system that has been
customized for a particular application. The term
derives from the idea that the end user can just
turn a key and the system is ready to go.


For microarray, this includes everything from OS,
server software, database, client software,
statistics software and even hardware


Examples


Genetraffic (Iobion)


Using Open Source softwares
-

LINUX, the R statistical
language, PostgreSQL, and Apache Web server


Rosetta resolver (Rosetta Biosoftware)


Sun Fire server and drive array, Oracle 8i, Rosetta server and
client side software

Turnkey system




Advantages


performance


Security


Support multiple users


Incorporate the experiment and data standards in design


Disadvantages


Expensive


Not suitable for small labs


Require dedicated supporting staff


Close system

Comprehensive software


Definition: Software incorporate many
different analyses for different stage in a
single package.


Examples


Cluster (Mike Eisen, LBNL)


GeneMaths (Applied Maths)


GeneSight (Biodiscovery)


GeneSpring (Silicon Genetics)

Comprehensive software


Cluster


Filter data


Adjust data
-

normalization, log
transform etc


Clustering


Self
-
Organizing Maps
(SOMs)


Principle Component
Analysis (PCA)


GeneSpring


& Promoter analysis


Gene annotation with
public database
information


Scripting tools


Access Open DataBase
Connectivity (ODBC)
databases

Comprehensive software


GeneMaths


& Bootstrap analysis
for clustering


Fast clustering
algoritms


Access Open DataBase
Connectivity

(ODBC)
database


GeneSight


& confidence analysis
for replicated data


statistical analysis for
significant genes


Graphical data set
builder


Comprehensive software


Advantages


Standardized operation


Generate various analysis easily


Shorter learning curve for biologist


Script language for automated process control


Some brilliant ideas or analysis within
particular software


“False” Sense of security?

Comprehensive software


Disadvantages


Inflexible to latest analysis development


Generate various analysis too easily


Implicit data analysis/ statistics background and
definitions


Proprietary script language


Data compatibility with other softwares


Necessity to design and maintain your own database


Commercial softwares can be expensive!


Adding particular analysis because of marketing
purpose, extra spending on unnecessary functions


Sometimes only available in a few computing platforms


Specific analysis software


Definition: Software performing a few/ one
specific analysis


Examples


GeneCluster (Whitehead Institute Centre
for genome research)


INCLUSive
-

INtegrated CLustering, Upstream
Sequence retrieval and motif Sampler
(Katholieke Universiteit Leuven)


SAM


Significance Analysis of Microarrays
(Stanford University)


Specific analysis software


GeneCluster


performing normalization,
filter and SOM

Specific analysis software


INCLUSive
-

INtegrated CLustering,
Upstream Sequence retrieval and motif
Sampler


SAM


finding statistical significant
differentially expressed gene

Specific analysis software


Advantages


Better statistical background reference, usually
with literature support


Disadvantages


Non
-
standardized environment


java, web,
excel… etc


Data compatibility problem


Data preprocessing problem

Extension/ accessory of other
software



Definition: extension of other software’s
capability


Examples:


Freeview: Visualization and Optimization of
Gene Clustering Dendrograms for Cluster


ArrayMiner: extension of GeneSpring

Statistics softwares


Excel


MATLAB


Octave


SAS


SPSS


S
-
PLUS


Statistica


R

Statistics softwares


Advantages


Highly flexible


High level, multivariate analyses are either
standard or easily programmable


Disadvantages


Usually command line driven, impossible to
learn intuitively (a disadvantage??)


Require a much better understanding of the
statistical data analysis to follow the steps (a
disadvantage??)



R
-
packages


A language and environment for statistical
computing and graphics.


Highly compatible to S/ S
-
plus


Open source under GNU General Public
Licence


Runs on many UNIX/ Linux/ windows
family and MacOS platform


There are growing number of microarray
analysis softwares (packages) written in R

R
-
packages


Dedicated for
microarray analysis


affy


Bioconductor


SMA extension


Cyber T


GeneSOM


Permax


OOMAL (S
-
Plus)


SMA


YASMA


General packages


cclust


cluster


mclust


multiv


mva


…etc!

R
-
packages


SMA
-

Statistical Microarray Analysis
(Terry Speed, UC Berkeley)


Bioconductor

R
-
packages


SMA


perform intensity and spatial dependent
normalization


Replicated array data analysis by an empirical
bayes approach

R
-
packages


Result of replicated data output B vs M plot

R
-
packages


Bioconductor


open source software project to provide infrastructure
in terms of design and software to assist biologists and
statisticians for analysing genomic data, with primary
emphasis on inference using DNA microarrays


Most software produced by the Bioconductor project
will be in the form of R libraries


Variation 1: provide basic infrastructure support that will help
other developers produce high quality software


Variation 2: provide innovative methodology for analyzing
genomic data


Provide some form of graphical user interface
for selected libraries


A mechanism for linking together different
groups with common goals

Future Data mining software


Standardized, open
-
source (free) platform?


EMBOSS
-

European Molecular Biology Open
Software Suite.


More supervised analysis package and
pathway prediction package?


Plugin modules


J
-
express


GeneSpring


Mutation analysis software


Chip based SNP or chromosomal aberration
analysis (arrayCGH)


Various forms of protocols, e.g. primer
extension, ligase chain reaction, MALDI
-
TOF
-
MS, hybridization..etc


Result is in the form of base calling or
allelic imbalance


Example


genorama




Definition: large collection of data organized
especially for rapid search and retrieval


Two categories


Within laboratory/ institute database; LIMS


Public expression database


Standardized definition of data


Minimum Information About a Microarray Experiment
(MIAME)


Experimental design


Array design


Samples


Hybridizations


Measurements


Normalization controls


Database

Database/ LIMS software


The database within your lab/ institute


The quality of in house data management
will affect the quality of final public data
repository


Database structure may be relatively
simple



Major Database/ LIMS software


AMAD


ARGUS


ArrayDB


ArrayInformatics


Clonetracker


GeNet


Genetraffic


GeneX


MAD


Maxd


NOMAD


Partisan Array LIMS


Phoretix Array
2
Database


Rosetta Resolver


SMD

Public Expression Database


Necessities


Provide raw data to validate published array
result and develop new analysis tools


Further understanding of your data


Compare among different groups, meta
-
data
mining


Source for specialty array design



Different categories


Generic


Species specific


Disease specific


The importance of data standardization


Major public gene expression
databases


3D
-
GeneExpression
Database


ArrayExpress


BodyMap


ChipDB


ExpressDB


Gene Expression Omnibus
(GEO)


Gene Expression Database
(GXD)


Gene Resource Locator


GeneX


Human Gene Expression
Index (HuGE Index)


RIKEN cDNA Expression
Array Database (READ)


RNA Abundance
Database (RAD)


Saccharomyces Genome
Database (SGD)


Standford Microarray
Database (SMD)


TissueInfo


yeast Microarray Global
Viewer (yMGV)

Primer/ probe design


Array designer


GAP (Genome
-

wide Automated Primer
finder servers)


OligoArray


Primer3


ProbeWiz Server



Other useful software for further
data mining


Data annotation


DRAGON


Gene Ontology


PubGene


Resourcerer


Promoter analysis


AlignACE


INCLUSive


MEME


Sequence Logo


Pathway reconstruction


GenMAPP


PathFinder


Data annotation


Link GI to a particular name


Literature mining to infer network


Network reconstruction


Cluster + promoter analysis


statistical inference from experimental data

Some suggestions for biologists who
are serious in microarray study


Communicate or even collaborate with
Statisticians, Mathematicians and
bioinformaticians


Learn a high level statistical language, e.g. R


Learn programming, e.g. C


Learn database, e.g. SQL


Learn Linux


Revise your statistics, probability and may be even
calculus


Lucky…?!

Picture omitted because of copyright reason

Conclusion


the future


A unified open environment for standard
analysis and development


The best microarray analysis software?



~ Exploratory data analysis can never be the
whole story, but nothing else can serve as
the foundation stone
--

as the first step. ~

John. W. Tukey