Course Syllabus
Brandeis University
Division of Graduate Profession Studies
Rabb School
of Continuing Studies
I. Course Information
1. Biological Data Mining and Modeling
2.
1
3
3
RBIF

112

1
DL
3.
Distance Learning Course
Week 1
starting on
Wednesday
Sep. 1
8
, 2012
.
The course week runs
Wednesdays through to Tuesdays
.
The last week ends Novem
ber 2
6
, 2012.
4. Instructor’ Contact Information
Madhu Natarajan, PhD
nataraja@brandeis.edu
Please use email to arrange appointments
.
5. Document Overview
This syllabus contains all relevant information about the course: its objectives and
outcomes, t
he grading criteria, the texts and other materials of instruction, and of weekly
topics, outcomes, assignments, and due dates.
Consider this your roadmap for
the course. Please read through
the syllabus carefully
and fell free
to share any questions that
you may have. Please print a copy of this
syllabus for reference.
6. Course Description
The development of new bioinformatics tools typically involves some form of data
modeling, prediction or optimization. This course introduces various modeling and
pr
ediction techniques including linear and nonlinear regression, principal component
analysis, support vector machines, self

organizing maps, neural networks, set
enrichment, Bayesian networks, and model

based analysis.
This course is
not
intended to explo
re intricacies of analysis methods and/or algorithm
development but to explore
how to use different
approaches
to analyze biological data
and extract some insight into biology
.
The didactic part of this course is designed to introduce you to
(a)
various
analysis
techniques &
method
s,
(b)
tool kits for implementing these methods
, and (c) introduction
to some biological/experimental methods providing the data for analysis using (a) and
(b)
.
Students will be introduced to
examples of analysis from scientific
literature
, and are
actively encouraged to identify new examples/sources and bring these to the class for
discussion
.
Part of the expectation of the student is
also
to contribute to weekly
discussions, especially around the pros and cons of methods, ident
ifying when methods
fail and how these translate into real life expectations of the practicing bioinformatician.
It is important to realize that distance learning does not imply learning in isolation

communication is crucial to success in a DL and provid
es opportunities for self

exploration, collaboration with peers and learning from your own A

Ha moments when
you
learn by asking
probing questions.
I look forward to our discussions during these ten
weeks.
Prerequisites: Probability & Statistics; Proficie
ncy in R programmin
g, RBIF 111.
7. Materials of Instruction
a.
Required Texts
Bioinformatics and Computational Biology Solutions Using R and Bioconductor
, Eds. R.
Gentleman, V. Carey, W. Huber, R. Irizarry, and S. Dudoit, Springer 1st
Edition, 2005.
ISB
N 0

387

25146

4
b. Required Software
R version
3
.
0
.1
(see
http://www.r

project.org/
)
c.
Optional
Text(s) / Jounals
The elements of statistical learning: data mining, inference, and prediction
Trevor Hastie,
Ro
bert Tibshirani, Jerome H. Friedman, Springer

Verlag, New York, 2009.
This book is
highly recommended
. It makes for very dense reading so I am not
using
this as a text

book, but it is a very useful reference manual for this course and for all practicing
bi
oinformaticians.
Handbook of Parametric and Nonparametric Statistical Procedures
, D. J. Sheskin,
Chapman & Hall/CRC 3rd Edition, 2003. ISBN

10 1584884401; ISBN

13 978

1584884408
Pattern Classification
, R.O. Duda, P.E. Hart, and D.G. Stork, Wiley

Interscie
nce 2nd
Edition, 2000. ISBN 0

471

05669

3
A Handbook of Statistical Analyses Using R
, B.S. Everitt and T. Hothorn, Chapman and
Hall/CRC 1st Edition, 2006. ISBN 1

584

88539

4
d. Online Course Content
This is a Distance Learning (DL) course, which will be
hosted at Brandeis’ LATTE site,
available at http://latte.brandeis.edu. The site contains the course syllabus, weekly topic
notes, assignments, and discussion forums through which we will communicate during
this course
.
8. Overall Course Objectives
This
course is inte
n
ded to provide students with an understanding of:
What methods are commonly used for data analysis
How analysis results are interpreted in the context of drug discovery and development
How specific software tools are applied to data mode
ling
9. Overall Course Outcomes
At the end of the course, students will be able to:
1.
Find the appropriate method for common data analysis problems
2.
Have a sense of where to look if these methods are insufficient
3.
Be familiar with the application of commo
nly used software tools
4.
Be able to compose a meaningful report of the analysis
10. Course Grading Criteria
Percentages earned per assignment:
Percent
Component
N/A
Course
questionnaire
5
0
%
Homework
problem sets
(
5
weeks)
3
0
%
Discussion and o
nline class participation
(10 weeks)
.
2
0
%
Final project
b. Grading Criteria for Discussions/Online Participation (100
raw
points total per week
,
translating to
3% of total course per week
)
Per GPS guidelines you must post on three different calendar d
ays of the course
week.
Failure to do so will result in a 10 point deduction from your total
participation points for
the week
.
There will be two discussion topics posted each week.
The 100 points for
discussion
each week are divided into
two 35 point
disc
ussion
response
s
to instructor posts, and
one
30
point responses to
a
peer post.
Exception
al posts are those that
(for example)
o
Provide/include
original analysis of the course material,
o
Provide/include
analysis of the same methods on novel data sets,
o
Pr
ovide/include
appended code that runs without errors,
o
Provide/include
extrapolation and analysis of where methods successfully
worked and where they did not,
o
Provide
appropriate citation of references,
o
A
re well

written (grammar/spelling).
Responses to p
eer posts will be graded on the
same
above criteria, but with
additional requirements that responses to peer posts must clearly identify the
original author/message to which the post is a comment in response, and provide
novel insight beyond a simple “I ag
ree”
In layman terms, the response must clearly go beyond being
the equivalent of
a
+1 or a “Like” post.
Any
discussion
disagreements on analysis, interpretation or results MUST be
polite and constructive.
This is
a critical and absolute requirement.
10
. Academic Integrity
http://www.brandeis.edu/studentaffairs/srcs/ai/index.html
All students are expected to read and understand the guidelin
es posted in the Academic
Hones
t
y
and St
udent Integrity
website
posted above. If any part of this is not clear, please
contact your instructor immediately.
II. Course Information
Week 1
(Sep 1
8

24
)
Introduction to biological data mining
and modeling
On the differences between data mining and mo
deling
An introduction to regression
An introduction to model building
Understanding the predictive power of modeling
o
When models go wrong
Overview of the field of biological data mining

applications
, challenges, future
directions.
Week 2 (Sep
25

Oct 1
)
Uncertainty in Biology
–
Causes, concerns, approaches to deal with uncertainty.
Introduction to data visualization
Introduction to normalization
Introduction to high throughput biology
High throughput technologies
What can we reliably measure and w
hat can it tell us about the cell?
a.
T
arget

based compound screening
b.
C
ell

based screening,
c.
H
igh content screens,
d.
L
arge scale RNAi
screens
.
Statistical analysis of screens, Z and Z’ factor, data visualization and integration.
Week 3
(
Oct 2

8
)
Unsupervis
ed methods
–
Part I
Hierarchical clustering
Principal component analysis
Independent component analysis;
Introduction to t
ranscription data
RNA
profiling technologies,
experimental design,
D
ata normalization,
A
pplication of clustering and dimension re
duction methods;
Week 4
(Oct
9

15
)
Unsupervised methods
–
Part II
Unsupervised: hierarchical clustering, principal component analysis,
independent
component analysis
Set enrichment methods,
M
eta analysis of microarray data
to build on identified patter
ns
Week 5
(Oct 1
6

22
)
Supervised methods, model assessment and selection
Linear methods for regression and classification:
L
inear discriminant analysis,
L
ogistic regression;
N
aïve Bayes classifier;
N
earest

neighbor method
Week 6
(
Oct
23

2
9
)
Supervise
d methods, model assessment and selection

Part II
Regression and classification trees,
Neural networks,
Support vector machines
Model assessment and selection: AIC, BIC, cross

validation;
Week 7
(Oct
30

Nov 05
)
Meta methods
Boosting trees,
M
odel ave
raging and bagging,
R
andom forest
Week 8
(Nov
06

12
)
Integration and meta

analysis of high throughput datasets
Biological database, set enrichment methods, text

mining
Proteomics
Review of proteomics technologies: 2D gels, mass spectrometry, protein arra
ys,
2

hybrid methods, post

translational modification detection,
Analysis of protein networks, network properties.
Protein pathways and their interaction
Protein pathway compendia
Pathway comparison metrics and applications
Data integration examples: rele
vance networks, machine learning.
Week 9
(Nov
13

19
)
Principles of biological networks.
Reconstruction of networks

Graphical models:
a.
Boolean networks,
b.
C
o

regulation networks,
c.
Bayesian networks.
Dynamic network inference
Week 10
(Nov
20

2
6
)
Mechanist
ic m
odeling of biological systems.
Principles of mechanistic modeling: mass balance, chemical reaction systems,
flux balance analysis
Deterministic and probabilistic modeling approaches to specific common
biological problems.
Written final project due
in LATTE
.
III. Course
Policies and Procedures
I. Late Policies
Discussion responses will be accepted late with a 5 (raw) point deduction per day.
Homework assignments will be accepted late with a 5 (raw) point deduction per
day after the deadline.
Su
b
st
antive responses to discussion posts will not be accepted after the
deadlines.
II. Work Expectations
Expect to spend about
2

4 hours per week reading the course material and
anywhere from
4

8
hours doing homework, responding to discussion posts, etc.
Pla
n ahead to make sure your tasks are completed in a timely manner.
The final project will be an amalgamation of tasks accomplished throughout the
course and will take an addition of 4

24 hours of work.
I will post weekly deadlines for expectations for the
week.
A cumulative list of all expectation deadlines will also be posted on Week 1.
III. Feedback
Homework and class participation grades will typically be posted within a week of
completion of tasks.
IV. Confidentiality
In the course of the class, som
e of you may want to post examples of real data
from your day jobs or other sources. Please remember that you must not share
any information that is confidential, proprietary or in any way embargoed from
public disclosure.
P
lease refrain from discussion o
f your peer’s work or interactions with peers
outside the classroom.
Comments 0
Log in to post a comment