Bioinformatics for Stem Cell

powerfultennesseeΒιοτεχνολογία

2 Οκτ 2013 (πριν από 4 χρόνια και 1 μήνα)

89 εμφανίσεις

Bioinformatics for Stem Cell

Lecture 1

Debashis

Sahoo
, PhD

Outline


Introduction


History of Bioinformatics


Introduction to computing


Data collection


Experiment design


Data analysis

Bioinformatics Definition


Biological Data


Representation


Storage


Access


Processing


bi∙o∙in∙for∙mat∙ics

[
bahy
-
oh
-
in
-
fer
-
mat
-
iks
]


noun ( used with a singular verb ) the retrieval and
analysis of biochemical and biological data using
mathematics and computer science, as in the study of
genomes.


http://www.merriam
-
webster.com/dictionary/bioinformatics

http://www.ncbi.nlm.nih.gov/About/primer/bioinformatics.html

http://www.ncbi.nlm.nih.gov/About/primer/bioinformatics.html

“It is hard for me to say confidently that, after
fifty more years of explosive growth of
computer science, there will still be a lot of
fascinating unsolved problems at peoples'
fingertips, that it won't be pretty much
working on refinements of well
-
explored
things. I can't be as confident about
computer science as I can about biology.
Biology easily has 500 years of exciting
problems to work on
, it's at that level.”

Professor Donald E. Knuth

The "father" of the analysis of algorithms

He is the author of the seminal multi
-
volume work
The Art
of Computer Programming
.

HISTORICAL PERSPECTIVE

History of Bioinformatics


Gergor Mendel (1866,
Verhandlungen des
naturforschenden Vereins Brünn
)


1951


structure for the alpha
-
helix
and beta
-
sheet


Pauling and Corey (PNAS


1951)


1953
-

double helix model for DNA


Watson and Crick (
Nature
,
171
: 737
-
738, 1953)


1955


protein sequence of bovine insulin


F. Sanger.

History of Bioinformatics


1958


1990


Revolution in Computer Science and Engineering


Computer, email, network, internet


1990


BLAST


Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990) "Basic
local alignment search tool." J. Mol. Biol. 215:403
-
410.


1995
-

The
Haemophilus influenzea

genome (1.8 Mb)
is sequenced.


1993


2013


Microarrays


2005


2013


High
-
throughput sequencing

INTRODUCTION TO COMPUTING

What is a computer?

1

0

0

1

0

1

1

0

Controller

Read/Write head

Tape

Turing Machine (1936)

Alan Turing, "On computable numbers, with an
application to the Entscheidungsproblem", Proceedings
of the London Mathematical Society, Series 2, 42
(1937), pp 230

265.

Modern Computer

Processor

Main Memory

Disk Drives

IO controller

Display

Keyboard

Mouse

What is a Computer Program?

Executable
file

Load to Memory

Run the
program

C Program

Assembly Program

DATA COLLECTION

Public Databases


Gene Expression Omnibus (GEO)


Array Express


National Center for Biotechnology Information (NCBI)


UCSC Genome Browser


The human protein atlas


Catalogue of Somatic Mutations in Cancer


COSMIC


The Cancer Genome Atlas (TCGA)


http://www.ncbi.nlm.nih.gov/geo/

http://www.ebi.ac.uk/arrayexpress

http://www.ncbi.nlm.nih.gov/pubmed

http://genome.ucsc.edu

http://www.proteinatlas.org/

http://www.sanger.ac.uk/genetics/CGP/cosmic/

https://tcga
-
data.nci.nih.gov/tcga/

EXPERIMENT DESIGN

To call in the statistician after the
experiment is done may be no more
than asking him to perform a
postmortem examination : he may be
able to say what the experiment died of.

-

R. A. Fisher

http://graphpad.com/guides/prism/5/
user
-
guide/prism5help.html

Independent Samples


Statistical tests are based on the assumption
that each subject was sampled independently.


Provides maximum amount of information.


Provides better estimation of the mean.

The Gaussian Approximation

Everybody believes in the normal
approximation, the experimenters
because they think it is a
mathematical theorem, the
mathematicians because they
think it is an experimental fact.

G. Lippman (1845


1921)

Sample Size Estimation

DATA ANALYSIS

Correlation

Hypothesis Testing


Randomly select samples from the population


State the null hypothesis


Distribution of values in two different populations
are the same


Perform the statistical test


T test, F test, Chi
-
sq test


Get P
-
value


Set a threshold (usually < 0.05) for significance

Multiple Comparisons


The Bonferroni correction


P < 0.05/N (N = number of comparisons)


False Discovery Rate (FDR)


Q value


What fraction of all the discoveries are false?


Q = 10%, N = 100, smallest p
-
value < Q/N


http://genomics.princeton.edu/storeylab/qvalue/


Permutation based approaches