Course Syllabus Brandeis University Division of Graduate Profession Studies Rabb School of Continuing Studies

kettlecatelbowcornerAI and Robotics

Nov 7, 2013 (3 years and 5 months ago)

60 views

Course Syllabus









Brandeis University


Division of Graduate Profession Studies


Rabb School

of Continuing Studies


I. Course Information


1. Biological Data Mining and Modeling


2.
1
3
3
RBIF
-
112
-
1
DL


3.
Distance Learning Course

Week 1
starting on
Wednesday

Sep. 1
8
, 2012
.

The course week runs
Wednesdays through to Tuesdays
.

The last week ends Novem
ber 2
6
, 2012.

4. Instructor’ Contact Information

Madhu Natarajan, PhD



nataraja@brandeis.edu

Please use email to arrange appointments
.

5. Document Overview


This syllabus contains all relevant information about the course: its objectives and
outcomes, t
he grading criteria, the texts and other materials of instruction, and of weekly
topics, outcomes, assignments, and due dates.


Consider this your roadmap for
the course. Please read through
the syllabus carefully
and fell free

to share any questions that

you may have. Please print a copy of this
syllabus for reference.

6. Course Description




The development of new bioinformatics tools typically involves some form of data
modeling, prediction or optimization. This course introduces various modeling and
pr
ediction techniques including linear and nonlinear regression, principal component
analysis, support vector machines, self
-
organizing maps, neural networks, set
enrichment, Bayesian networks, and model
-
based analysis.




This course is
not

intended to explo
re intricacies of analysis methods and/or algorithm
development but to explore
how to use different

approaches
to analyze biological data
and extract some insight into biology
.




The didactic part of this course is designed to introduce you to
(a)
various
analysis
techniques &
method
s,

(b)
tool kits for implementing these methods
, and (c) introduction
to some biological/experimental methods providing the data for analysis using (a) and
(b)
.
Students will be introduced to
examples of analysis from scientific

literature
, and are
actively encouraged to identify new examples/sources and bring these to the class for
discussion
.

Part of the expectation of the student is
also
to contribute to weekly
discussions, especially around the pros and cons of methods, ident
ifying when methods
fail and how these translate into real life expectations of the practicing bioinformatician.
It is important to realize that distance learning does not imply learning in isolation
-

communication is crucial to success in a DL and provid
es opportunities for self
-
exploration, collaboration with peers and learning from your own A
-
Ha moments when
you
learn by asking
probing questions.

I look forward to our discussions during these ten
weeks.




Prerequisites: Probability & Statistics; Proficie
ncy in R programmin
g, RBIF 111.


7. Materials of Instruction


a.

Required Texts




Bioinformatics and Computational Biology Solutions Using R and Bioconductor
, Eds. R.
Gentleman, V. Carey, W. Huber, R. Irizarry, and S. Dudoit, Springer 1st

Edition, 2005.
ISB
N 0
-
387
-
25146
-
4

b. Required Software




R version
3
.
0
.1

(see
http://www.r
-
project.org/

)

c.
Optional

Text(s) / Jounals




The elements of statistical learning: data mining, inference, and prediction

Trevor Hastie,
Ro
bert Tibshirani, Jerome H. Friedman, Springer
-
Verlag, New York, 2009.
This book is
highly recommended
. It makes for very dense reading so I am not
using

this as a text
-
book, but it is a very useful reference manual for this course and for all practicing
bi
oinformaticians.



Handbook of Parametric and Nonparametric Statistical Procedures
, D. J. Sheskin,
Chapman & Hall/CRC 3rd Edition, 2003. ISBN
-
10 1584884401; ISBN
-
13 978
-
1584884408



Pattern Classification
, R.O. Duda, P.E. Hart, and D.G. Stork, Wiley
-
Interscie
nce 2nd
Edition, 2000. ISBN 0
-
471
-
05669
-
3



A Handbook of Statistical Analyses Using R
, B.S. Everitt and T. Hothorn, Chapman and
Hall/CRC 1st Edition, 2006. ISBN 1
-
584
-
88539
-
4

d. Online Course Content




This is a Distance Learning (DL) course, which will be

hosted at Brandeis’ LATTE site,
available at http://latte.brandeis.edu. The site contains the course syllabus, weekly topic
notes, assignments, and discussion forums through which we will communicate during
this course
.


8. Overall Course Objectives

This

course is inte
n
ded to provide students with an understanding of:



What methods are commonly used for data analysis



How analysis results are interpreted in the context of drug discovery and development



How specific software tools are applied to data mode
ling

9. Overall Course Outcomes

At the end of the course, students will be able to:

1.

Find the appropriate method for common data analysis problems

2.

Have a sense of where to look if these methods are insufficient

3.

Be familiar with the application of commo
nly used software tools

4.

Be able to compose a meaningful report of the analysis


10. Course Grading Criteria

Percentages earned per assignment:

Percent

Component

N/A

Course
questionnaire


5
0
%

Homework
problem sets

(
5

weeks)

3
0
%

Discussion and o
nline class participation

(10 weeks)
.

2
0
%

Final project


b. Grading Criteria for Discussions/Online Participation (100
raw
points total per week
,
translating to
3% of total course per week
)



Per GPS guidelines you must post on three different calendar d
ays of the course
week.

Failure to do so will result in a 10 point deduction from your total
participation points for

the week
.



There will be two discussion topics posted each week.



The 100 points for
discussion
each week are divided into
two 35 point
disc
ussion
response
s

to instructor posts, and
one

30

point responses to
a
peer post.



Exception
al posts are those that
(for example)

o

Provide/include
original analysis of the course material,

o

Provide/include
analysis of the same methods on novel data sets,

o

Pr
ovide/include
appended code that runs without errors,

o

Provide/include
extrapolation and analysis of where methods successfully
worked and where they did not,

o

Provide
appropriate citation of references,

o

A
re well
-
written (grammar/spelling).



Responses to p
eer posts will be graded on the
same
above criteria, but with
additional requirements that responses to peer posts must clearly identify the
original author/message to which the post is a comment in response, and provide
novel insight beyond a simple “I ag
ree”



In layman terms, the response must clearly go beyond being
the equivalent of
a
+1 or a “Like” post.



Any
discussion
disagreements on analysis, interpretation or results MUST be
polite and constructive.

This is
a critical and absolute requirement.



10
. Academic Integrity

http://www.brandeis.edu/studentaffairs/srcs/ai/index.html


All students are expected to read and understand the guidelin
es posted in the Academic
Hones
t
y

and St
udent Integrity
website

posted above. If any part of this is not clear, please
contact your instructor immediately.

II. Course Information


Week 1
(Sep 1
8
-
24
)

Introduction to biological data mining

and modeling



On the differences between data mining and mo
deling



An introduction to regression



An introduction to model building



Understanding the predictive power of modeling

o

When models go wrong



Overview of the field of biological data mining
-

applications
, challenges, future
directions.


Week 2 (Sep
25
-
Oct 1
)


Uncertainty in Biology


Causes, concerns, approaches to deal with uncertainty.



Introduction to data visualization



Introduction to normalization


Introduction to high throughput biology



High throughput technologies




What can we reliably measure and w
hat can it tell us about the cell?

a.

T
arget
-
based compound screening

b.

C
ell
-
based screening,

c.

H
igh content screens,

d.

L
arge scale RNAi

screens
.



Statistical analysis of screens, Z and Z’ factor, data visualization and integration.


Week 3

(
Oct 2
-
8
)


Unsupervis
ed methods


Part I



Hierarchical clustering



Principal component analysis



Independent component analysis;


Introduction to t
ranscription data




RNA
profiling technologies,
experimental design,



D
ata normalization,



A
pplication of clustering and dimension re
duction methods;


Week 4

(Oct
9
-
15
)

Unsupervised methods



Part II




Unsupervised: hierarchical clustering, principal component analysis,


independent
component analysis



Set enrichment methods,



M
eta analysis of microarray data

to build on identified patter
ns


Week 5

(Oct 1
6
-
22
)

Supervised methods, model assessment and selection



Linear methods for regression and classification:



L
inear discriminant analysis,



L
ogistic regression;



N
aïve Bayes classifier;



N
earest
-
neighbor method


Week 6

(
Oct

23
-
2
9
)

Supervise
d methods, model assessment and selection
-

Part II



Regression and classification trees,



Neural networks,



Support vector machines



Model assessment and selection: AIC, BIC, cross
-
validation;


Week 7

(Oct
30
-
Nov 05
)

Meta methods



Boosting trees,



M
odel ave
raging and bagging,



R
andom forest


Week 8

(Nov
06
-
12
)

Integration and meta
-
analysis of high throughput datasets



Biological database, set enrichment methods, text
-
mining

Proteomics



Review of proteomics technologies: 2D gels, mass spectrometry, protein arra
ys,
2
-
hybrid methods, post
-
translational modification detection,



Analysis of protein networks, network properties.

Protein pathways and their interaction



Protein pathway compendia



Pathway comparison metrics and applications



Data integration examples: rele
vance networks, machine learning.


Week 9

(Nov
13
-
19
)

Principles of biological networks.



Reconstruction of networks
-

Graphical models:

a.

Boolean networks,

b.

C
o
-
regulation networks,

c.

Bayesian networks.



Dynamic network inference


Week 10

(Nov
20
-
2
6
)

Mechanist
ic m
odeling of biological systems.




Principles of mechanistic modeling: mass balance, chemical reaction systems,
flux balance analysis



Deterministic and probabilistic modeling approaches to specific common
biological problems.






Written final project due

in LATTE
.

III. Course

Policies and Procedures


I. Late Policies



Discussion responses will be accepted late with a 5 (raw) point deduction per day.



Homework assignments will be accepted late with a 5 (raw) point deduction per
day after the deadline.



Su
b
st
antive responses to discussion posts will not be accepted after the
deadlines.


II. Work Expectations



Expect to spend about
2
-
4 hours per week reading the course material and
anywhere from
4
-
8

hours doing homework, responding to discussion posts, etc.



Pla
n ahead to make sure your tasks are completed in a timely manner.



The final project will be an amalgamation of tasks accomplished throughout the
course and will take an addition of 4
-
24 hours of work.



I will post weekly deadlines for expectations for the

week.



A cumulative list of all expectation deadlines will also be posted on Week 1.


III. Feedback



Homework and class participation grades will typically be posted within a week of
completion of tasks.


IV. Confidentiality



In the course of the class, som
e of you may want to post examples of real data
from your day jobs or other sources. Please remember that you must not share
any information that is confidential, proprietary or in any way embargoed from
public disclosure.



P
lease refrain from discussion o
f your peer’s work or interactions with peers
outside the classroom.