1
FW4089: Bioinformatics (3 credit
s
)
FW5089: Tools of Bioinformatics (4 credit
s
)
Time:
Every Tuesday and Thursday, 9.35 am to 10.55 am (3 hours)
Place:
Forestry, Room No. 139
Note: Presentation of class paper w
ill be arranged sometime in early April 2006.
Final exam will be held sometime in the week of April 24 to 28, 2006.
Instructor:
Shekhar Joshi (C. P. Joshi),
Associate Professor of Plant Molecular Genetics, SFRES
Room 168, Forestry, Phone: 487
-
3480 (cpjo
shi@mtu.edu)
Office hours: 9 am to 6 pm except when I teach this class!
Teaching assistants:
Shiv T. and Frank Xu
(FMGB Graduate students)
Course Description
The main purpose of this course is to provide
extensive
hands
-
on
-
experience in
using a variety
of
Bioinformatics tools
and
in future you could extrapolate
that
knowledge to other fields of biology such as genomics, molecular phylogenetics, and
biotechnology. You will not write
Bioinformatics
programs but
will
use the available
ones for extensive seq
uence analysis.
Why was this course proposed?
A number of sequence analysis packages and databases are currently available from
the
commercial
sources
as well as public web sites. In our day
-
to
-
day molecular
biology
research, we use some of these program
s and databases to analyze the
significance of the new
genetic
information that we obtain. But it is not always easy
to choose the correct approach
or appropriate tool
. Databases are growing at a
very
fast pace and new questions are constantly popping up.
Moreover, genomics is a
new and exciting field of biotechnology that has recently witnessed many conceptual
and technical advances. Ability to make sense of this information explosion will
make our students more competitive in the current job markets in th
e fields of
academics and industries. There is no doubt that this knowledge will be extremely
valuable for living in this century.
2
FW4089/5089
Tools of Bioinformatics
GENERAL TEXTBOOKS (Optional Reading material)
1)
Genes VII
Benjamin Lewin, 2000, Oxford University Press
2)
Molecular Biology
Robert F. Weaver, 1999, McGraw
-
Hill Press
3)
Bioinformatics
David W. Mount, 2001, CSH Press
All these books will provide
only supplemental material for the
course and may be available at the MTU Book Store or in the
library.
Reading materials for the topics being covered in the class will
be provided.
Although there is no specific prerequisite for this class, it is
advis
able to have taken at least one of the following and have
some background in genomics and bioinformatics:
BL4030: Molecular Biology
FW4087/5087: Plant Molecular Genetics
3
FW4089/5089
Tools of Bioinformatics
THIS COURSE WILL NOT TEACH YOU HOW TO WRI
TE PROGRAMS.
Bioinformatics Reference Books available in the MTU Library
Guide to Human Genome Computing (Second Edition)
by Bishop MJ
Call No
. QH445.3 .G85 1998
Bioinformatics: The machine learning approach by
P. Baldi and S. Brunak
Call No.
QH506 .
B35 1998
Sequence Analysis in Molecular Biology
by G Von Heijne
Call No.
QP551 .H43 1987
Biological Sequence Analysis: Probabilistic Models Of Proteins And Nucleic Acids
by R Durbin, S. Eddy, A. Krogh, G. Mitchison
Call No.
QP620 .B576 1998
Algorithms O
n Strings, Trees And Sequences: Computer Science And
Computational Biology
by Dan Gusfield
Call No.
QA76.9 .A43 G87 1997
Introduction To Computational Biology
by Michael S. Waterman
Call No.
QH438.4 .M33W38 199
5
Calculating The Secrets Of Life
by Er
ic Lander And Michael Waterman
Call No.
QH438.4 .M3 C35 1995
.
Some internet addresses where Bioinformatics information is available:
National Center of Biotechnology Information (GenBank)
http://www.ncbi.nlm.ni
h.gov/
Genetics Computer Group: http://
www.GCG.com
Protein analysis:
http://www.expasy.ch
Celera Genomics:
http://www.celera.com
4
FW
4089/5089
Bi
oinformatics
GRADING SYSTEM
Grade Scale
100
-
95 = A
Excellent
94
-
90 = AB
Very Good
89
-
85 = B
Good
84
-
80 = BC
Above Average
79
-
75 = C
Average
74
-
70 = CD Below Av
erage
69
-
60 = D
Inferior
60
-
= F
Failure
Course Points
Home work
, quiz etc=
30%
Mid
-
term
Exam 1 = 30%
Final Exam = 30%
Class
Participation= 10%
Exams: The
midterm and
cumulative final
s
will be worth 100 points.
Class Paper = One Credit
for FW5089
5
Jobs! Jobs! Jobs!
Current Job trends:
http://www.sloan.org/programs/scitech_page1.html
Jobs in Genomics:
http://www.genomejobs.com
See also Science and Natu
re for Job ads.
Bioinformatics is a young science but the information explosion has
demanded more people in academics and industries. It is easy to get either a
molecular
biologist or a computer scientist but th
e
job
of bioinform
a
tician
needs both. Biolo
gist who can compute and a computer scientist who can
make sense out of biological data are hot commodities.
Supply and demand!
This is what I heard but do not quote me anywhere!
MS in Bioinformatics: 60
-
100 K
Ph.D. in Bioinformatics: 80
-
100K or highe
r
All CS people do not find money that attractive! But those who are
interested in the topic do very well
in this field
. New challenges and
questions biologists are facing every day and CS is providing the answer.
True collaboration!
Having this course l
isted in your CVs will help in your job prospects.
6
http://www.bio.mtu.edu/campbell/bl4820/intro/plagiarism.htm
Plagiarism
-
What It Is and How to Avoid It!
Adapted from Notes pr
epared by Ron Gratz
Scientists do not work in isolation from each other. Attendance at scientific meetings
exposes us to the work of our colleagues and allows for the free exchange of ideas.
Reading the published literature in our fields is vital for all s
cientists, who must keep
themselves current with what is being done in other laboratories. Scientists continually
refer to the work of their colleagues and most scientific research is based at least in part
on ideas derived from others. Review articles and
textbooks are often wholly based on
already published work. It is thus necessary for you as developing scientists to learn how
to properly use previously reported knowledge.
While a free flow of ideas and information is vital to scientific progress, it al
so presents
avenues for fraud, particularly plagiarism. Plagiarism can be defined as "Taking the ideas
from another and passing them off as one's own" (Webster's New World Dictionary) and
is unacceptable under any circumstances. Despite this universal disa
pproval, it is one of
the more common faults with student papers. In some cases, it is a case of downright
dishonesty brought upon by laziness but more often it Is lack of experience as how to
properly use material taken from another source.
To avoid plagi
arism you must not only properly attribute the ideas of another but must
also either paraphrase what the original author said or wrote or you must enclose that
person's exact words in quotation marks. To use another's exact words with attribution
but witho
ut quotation marks implies that the ideas belong to the original source but that
the words are your own. Besides being dishonest, copying another’s work defeats the
purpose of your education. Writing about the subject you are studying is a great way to
lea
rn. Ideas become more firmly implanted in your memory if you have to think about
them and then write a coherent statement using them. Copying another’s work prevents
you from learning, which is the whole purpose of your education.
Whenever the words or ide
as of another individual are used, proper attribution must be
given. In other words, you must give credit for those ideas and words to their originator.
Not to do so is a clear case of plagiarism. Plagiarism in classwork may result in a failing
grade or ev
en expulsion from the university. Plagiarism in professional work may result
in dismissal from an academic position, being barred from publishing in a particular
journal or from receiving funds from a particular granting agency, or even a lawsuit and
crimi
nal prosecution.
In a review article, the author attempts to summarize all of the pertinent work done in a
particular field of study. The goal is generally twofold: (1) to report what has been done
and what has been learned; and (2) to use this knowledge
to generate general conclusions
based on these previous works. The author of a review article must be able to present the
cited work accurately and be able to synthesize new ideas from this work. In order to
7
accurately represent the work of others and at t
he same time avoid plagiarism, the author
of a review will often paraphrase the statements made in the cited work.
The problem for many students, and some professional scientists, is that they do not know
how to properly paraphrase another's words. Several
general rules for paraphrasing that
are relevant for students learning to master this skill are:
1. You should change both the sentence structure and the non
-
technical terms in order to
avoid plagiarism.
2. You can also avoid plagiarism by altering the se
quence of subject matter within and
between sentences.
3. Don't paraphrase technical terms unless you are certain of their exact meaning and can
provide an exact equivalent.
4. Accredit the original author within the group of sentences using his/her work.
8
FW4089
and
FW5089: Bioinformatics questionnaire
Your name:
ID number:
Department:
Graduate student/Undergraduate:
Name of Advisor if Graduate student:
Motivation for taking this course:
Previous experience with Unix, GCG or other sequence anal
ysis packages
What do you expect to get out of this course?
Have you understood the problems of plagiarism? Yes No
Do you know what my office hours are? Yes No
Are you clear about grading policy? Yes No
9
First QUIZ
of Plant Bio
informatics
Date: January 10, 2006
Write one line answers to as many questions as possible in next 45 minutes. Feel free to
refer to books/web etc. This will not be counted towards your grade. I just want to know
where you stand with molecular biology ba
ckground:
1.
DNA stands for
2.
RNA stands for
3.
DNA is made up of
4.
RNA is made up of
5.
What is the difference between Deoxyribose sugar and ribose sugar?
6.
What are the different types of nitrogen bases in DNA?
7.
What are the different types of nitrogen bases in
RNA?
8.
What is the difference between purines and pyrimidines?
9.
Name 2 purines and three pyrimidines
10.
Which purine pairs with which pyrimidines? State the number of H bonds
between each pair.
11.
What are the differences between DNA and RNA?
12.
What is transcrip
tion and translation?
13.
What is central dogma in molecular biology?
14.
What is reverse transcription?
15.
What is a prokaryote?
16.
What is a Eukaryote?
17.
What are the differences between prokaryote and Eukaryotes
18.
What is a genome?
10
19.
What is genomics?
20.
How many geno
mes are present in viruses, prokaryotes, plants and animals?
Where?
21.
What is bioinformatics?
22.
What is the biological name for humans (binomial)
23.
How big is the human genome?
24.
How many chromosomes are there in a human diploid and haploid cell?
25.
How are hum
an genes arranged in the genome?
26.
How many human genes are there?
27.
What proportion of human genome is made up of genes?
28.
What is a gene?
29.
Why eukaryotic genes are said to be split?
30.
How does DNA replicate? Conservatively or semi
-
conservatively? What is th
e
difference?
31.
How does DNA make RNA?
32.
How many types of RNA are produced in a cell?
33.
How many of these RNAs are said to be protein coding?
34.
What is pre
-
mRNA? Is it present in bacteria?
35.
What are the main three steps in pre
-
mRNA processing?
36.
What is the 5’
leader and 3’trailor sequence in pre
-
mRNA?
37.
What is the difference between exons and introns?
38.
How are introns spliced off?
39.
Why are introns there?
40.
How transcription process in regulated in prokaryotes?
11
41.
How transcription process is regulated in eukaryot
es?
42.
What is a TATA box and AATAAA box?
43.
What is a transcription factor?
44.
Why TFIID is said to a commitment factor?
45.
What is a transcription start site?
46.
What is polyadenylation? Why is it an important biological process? Is it present
in bacteria?
47.
Descr
ibe the process of polyadenylation.
48.
Define “protein”. What alternative forms are proteins present in a cell?
49.
How many types of amino acids are typically present? Name five amino acids?
What are their 3 letter and 1 letter codes?
50.
How does a code present
in DNA is used to make proteins?
51.
Do you believe that genome is life’s instruction book? Why?
52.
If you have a disease gene (what does that mean), do you always get the disease?
53.
What is a mutation? Name a few types of mutations.
54.
What are the translation st
art and stop sites?
55.
What is tRNA?
56.
What is rRNA?
57.
What is ribosome?
58.
What is the genetic code? Who discovered it?(Bonus)
59.
Is genetic code Universal? What does it tell about our evolution?
60.
Why a code is said to be made up of triplet?
61.
What is codon bias?
12
62.
What is wobbling hypothesis?
63.
Who discovered the structure of DNA?
64.
What is reverse transcription? Who discovered it?
65.
Do you believe that viruses are most evolved organisms? If yes, Why? If not why
not?
66.
What is mitosis and meiosis?
67.
What are the main s
teps in mitosis? How many cells are produced at the end of
one cycle of mitosis?
68.
What are the main steps in meiosis?
How many cells are produced at the end of
one cycle of meiosis?
69.
What is the recombination?
70.
Do bacteria recombine?
71.
What is DNA sequencin
g? Who discovered it?
72.
What is dideoxynucleotides? Why they are important in sequencing?
73.
How can you sequence a gene?
74.
Why DNA sequence is written in only one line when it is double stranded?
75.
Which DNA strand is always denoted when writing a gene sequen
ce?
76.
How can you derive which protein a gene encodes by just looking at a gene
sequence? (BONUS).
13
Bioinformatics and
The
Human
G
enome
Human genome is the biggest gift
of science
to humanity.
We have achieved something new in 2001 that we
ha
ve only dreamed of for many
years.
Human genome is just the beginning of our exciting and sometimes fearful journey. Fear
of unknown lurks around there but the promise of tomorrow is also bright and vivid.
Sequenced organisms (From Science 291, Feb 2001
pp 1178)
Organism
genome size
year completed
No. of genes
H. influenzae
1.8 MB
1995
1740
S. cerevisiae (yeast)
12.1 MB
1996
6034
C. elegans (worm)
97 MB
1998
19099
A. thaliana (water cress)
100MB
2000
25,000
D. melanogaster (fruit fly)
180
MB
2000
13,061
H. sapiens(human)
3000 MB
2001
35
-
45,000
Rice…Poplar…mouse…
more than 200 genomes sequenced and list is ever
-
increasing.
Human genome was a dream for which thousands of scientists worked for over 15 years.
Celera and HGP provided tw
o books for price of one. Celera achieved it in 3 years but
heavily depended on public data. How did we do what we set out for? That is what is now
written in Science and Nature articles.
What it means is still unknown.
They say that 200 telephone books
of New York equivalent pages will be needed to print
3 billion bp of genome per cell. But Internet would allow this easily.
Humans were supposed to have 100,000 genes but seems like only 32,000 are possible.
Does that make
humans
less powerful or inad
equate in any way?
No, “The purpose of science is to find meaningful simplicity in the midst of complexity”
Herbert Simon (Nature 409, 771, 2001). DNA structure and PCR are best examples.
One gene work
s
harder
at many places and
many
times
. So less is
better in that
crammed nuclear space.
Alternative splicing.
Human proteins have
the
same domains as worms but the way these domains
come together is unique.
We will know one day what makes up a human.
We
all a
re unique
!
A
ll sexually reproduced organism
s have the entire ensemble of
the
gen
es
in one organism only once. One genotype occurs only once
.
14
There are
also
some surprises in human genome!
SNPs accumulate with a specific pattern
Regulatory CpG islands occur more in gene rich regions than gene less
TEs in gene poor regions
Only 1.1
-
1.5 % of the genome is coding not even 3% as widely estimated earlier
Parts of chromosome 12 in men and chromosome 16 in women are recombination
prone.
Repetitive DNA is only 40
-
45%
Humans share 223 genes from bacteria th
at are absent in worm, fly and yeast
genome.
Did genome duplicate early on similar to plants?
We will know how humans develop from zygote: ontogeny
We will know our phylogeny looking at ontogeny: molecular archeology
One day we will be trace our evolution
using the genome information.
Geneology of human race!
CLASS PAPER (1 credit worth of extra work)
Each of you will select a different gene family from human genome to write an essay on
How to build a better human?
You will also present your research
finding to class. You may select either a human
disease or a trait that you are interested in studying further. Collect all necessary
background information and collect genes associated with your topic. Find the
counterparts of your gene of interest in oth
er organisms and develop a phylogenetic tree.
You are expected to use as many bioinformatics programs as possible that you learnt in
this class to create a comprehensive database of genes that you have selected.
Important: Provide me with a list of all
reference work (printed materials and web site
addresses) that you used. Write in your own words. I plan to put your essays and
databases on web so watch out that you are not accused of plagiarism. See the handout for
more information on plagiarism.
For
FW5089: You have to do one more extra project to earn the fourth credit. I will
discuss this separately with you all.
15
FW4089: How to use
GCG
in the GIS lab?
Sit on any computer and shake the mouse to activate or wake the computer up.
Press
control a
lt delete and then
Enter your username and password (first initial of your first name and first 7 numbers of
your id)
Your userids may be the MTU ones.
The following procedure you will do every time you come for the class
(unless things
change in nex
t few days due to new arrival of GCG at Mango server)
:
Go to telnet and connect with oak by typing
telnet oak.ffr.mtu.edu
You will get window
for login: type your login name and
enter password; see
oak
%
Type
source
/gcg/gcgstartup
then hit retur
n
Then type
gcg
You should see
GCG logo
!
Start using GCG programs!
For GCG manuals go to:
http://forestry.mtu.edu/manuals/gcg/index.htm
16
Tutorial on using Unix:
Useful Unix Commands: GCG is unfriendly!! It is not Mac or PC based.
Not for distributi
on. For personal use only.
Login: connect or telnet with
oak
the server where GCG is loaded!
Type the password correctly and enter
You should see
oak
%
Logout: Do not forget to logout at the end of the session. Nothing saved will be lost.
Important note:
Do not give your username or password to anyone. If someone wants to
use it for GCG, ask him or her to contact his or her supervisor and then me. Any
unauthorized use will cost you the loss of GCG privileges.
UNIX Commands
UNIX commands are entered at th
e prompt> and delivered to the system with the
<RETURN> key.
UNIX commands have a syntax, just like any language; there is a correct order for the
words in a command, and MANY incorrect orders. Mix up the order, and UNIX is
unlikely to be clever enough t
o understand what you want it to do!
It is a d
umb
Computer!
The most general form of UNIX command syntax is
Prompt> command
-
flag(s) argument(s)
Prompt. =
oak
%
The command is WHAT you want to do, the
-
flags help refine the command, saying
HOW you wa
nt it done, and the arguments tell the OBJECT of the command
-
the things
to be acted upon.
UNIX expects all of its commands to be lower
-
case, though flags and arguments may be a
mixture of cases. Remember,
UNIX is case
-
sensitive
!
As a trivial example,
suppose you wanted to translate the following English request
"Would you please quickly shovel the snow in the driveway today?"
into UNIX. The translation might look something like
17
prompt> shovel
-
quickly
-
today snow
In fact, given the a
bsence of vowels and longer words from most UNIX commands and
flags, the actual command is more likely to be
prompt> sw
-
f
-
n snow
where sw is short for shovel,
-
f is short for fast (=quickly), and
-
n is short for now
(=today).
For a genuine exa
mple of a UNIX command, consider
mango% ls
-
la Dirname
Here, ls is short for list,
-
l is short for long (=all details), and
-
a is short for all (=all files,
even the hidden ones). Dirname is the name of the directory of files for which you want
th
e listing.
Finally, when using GCG commands in UNIX, there is one important "feature" for
the arguments; the case you use for the names of database entries is unimportant,
but all filenames must be in lower case and typed or copied and pasted correctly.
Text files
Data on computers (text, programmes, sequences etc.) is held in blocks of information
called 'files'.
Different files have different names and/or different locations
-
and there is a convention
that filenames end with a three
-
letter extensi
on that indicates the type of data
held in the file, e.g., .txt for text, .seq for sequences, .pep for peptides, .dat for generic
data, etc.
Files can be created, deleted, altered, overwritten, moved around, copied, renamed,
printed out to a screen or a
printer, searched, compared, sorted, counted and transferred
over the network to computers on other sites.
Some UNIX commands for file management:
touch filename
-
create a file [ holding no information! ]
pico filename
-
edit the file
using the pico editor [ use <CTRL> X to exit ]
cp filename newfilename
-
copy a file to a new file [ retains the old file ]
mv filename newfilename
-
move (rename) a file to a new file [ deletes the old file ]
18
cat filename
-
conc
atenate (print) a files contents to the screen
more filename
-
print a files contents to the screen, one page at a time [ use
<SPACE> to see the next page ]
cat filename1 filename2 > filename3
-
concatenate (print) the contents of the fir
st two
files into the third
rm filename
-
remove (delete) the file
dangerous to use with wildcard *
Exercise DNA Analysis
-
UNIX 1: create and manage files
Create a file named easyunix.txt
prompt> touch easyunix.txt
(NB: you may
use any UNIX text editor you like
-
pico is
probably the simplest
but we will use vi today
)
prompt>
vi
easyunix.txt
Edit the file and enter "UNIX is EASY!". Exit by typing
:X
and save the changes.
To p
rint easyunix.txt to the scre
en.
prompt> more easyunix.txt
Copy easyunix.txt to the file opinion.txt (How would you do this with cat? Hint!)
prompt> cp easyunix.txt opinion.txt
Rename easyunix.txt to unixcmds.txt
prompt> mv easyunix.txt unixcmds.txt
Edit the file unixcmds.txt
with vi editor
. Move down the screen with the arrow cursor
keys and type what you now know about UNIX. Exit and save the new changes.
prompt>
vi
unixcmds.txt
Print unixcmds.txt to the screen to see how clever
you have become.
prompt> more unixcmds.txt
19
Delete opinion.txt.
prompt> rm opinion.txt
Directories
A directory is a group of files or other directories. A directory within another is often
called a sub
-
directory, to reflect this hie
rarchical organization.
Directories can be created, copied, deleted, renamed, searched and transferred over the
network to computers on other sites. Files can be moved between or copied among
specified directories.
You work in one directory at a time.
This is known as the present working directory. The
directory you begin with when you login is your home directory.
PWD: print working directory
You can easily return to your home directory from any other directory by giving the
UNIX command "cd" with n
o argument.
Some UNIX commands for directory management:
cd dirname
-
change to the directory named dirname
cd ..
-
change to the directory above the present one [ ".." = up ]
cd
-
change to your home directory [ the default ar
gument for cd is your home
directory ]
ls
-
list the files in the present working directory
ls
-
l
-
a file list that is longer, more detailed
mkdir subdirname
-
make (create) a new sub
-
directory in the present directory
rm
dir subdirname
-
remove (delete) a sub
-
directory in the present directory
mv filename dirname
-
move a file into a sub
-
directory
Exercise: create and manage directories
20
Create a sub
-
directory named Unixinfo
prompt> mkdir Unix
info
Switch your present working directory to the new sub
-
directory
prompt> cd Unixinfo
Check to see you are there
prompt> pwd
Move a file from the directory above into your new present working directory (".." is a
short
form for the directory above, and "." is a short form
for the present directory)
prompt> cp ../unixcmds.txt .
Has the file moved? It should occur in the second list (";" separates the two list
commands)
prompt> ls
-
l .. ; ls
-
l
Get back to your home directory
prompt> cd
21
Enter the password to open this PDF file:
File name:
-
File size:
-
Title:
-
Author:
-
Subject:
-
Keywords:
-
Creation Date:
-
Modification Date:
-
Creator:
-
PDF Producer:
-
PDF Version:
-
Page Count:
-
Preparing document for printing…
0%
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο