For this first lab using the bioinformatics tools that are found on the ...

moredwarfBiotechnology

Oct 1, 2013 (3 years and 8 months ago)

104 views

Bio/CS
-
251

Laboratory 2

Jan 31, 2007


Examining A Singl
e DNA Sequence

Christine Frielle

January 31, 2007



For this first lab using the bioinformatics tools that are found on the web we will follow
the last part of Chapter 5 of
Bioinformatics for Dummies
, henceforth abbreviated as
BFD
.
The first part of the chapter deals wi
th “cleaning up” a sequence of DNA that a
microbiologist may have collected in the lab and also with designing PCR Primers. We
will discuss this latter topic at a future date as we approach our “wet lab” exercise. For
now, we will “borrow” a known data s
equence from the NCBI web page:


http://www.ncbi.nlm.nih.gov/


The gene that we choose is the
mutS/hMSH2 DNA repair gene
. In addition to
following the readings and guided s
teps on pages 151


159, we will ask you to answer
some questions related to your findings.


First we give some background on this gene


mutS

is the name given to a prokaryotic (
bacterial
) defender of the genome. (“mut” is an
abbreviation to reflect the in
creased rate at which DNA
mut
ations accumulate in cells
that do not have this critical gene).


This gene is universal in that it is found in virtually every organism, both prokaryotic and
eukaryotic.


MSH2

is the name given to the eukaryotic (algae and f
ungi, plants and animals) version
of this gene (
“MSH”
is an abbreviation that means “
M
ut
S

H
omolog”). The term
“homolog” means that MSH2 looks and acts like the
mutS
gene,
i.e.,

its structure is
similar to
mutS

and it plays a similar role in preventing mut
ations from occurring.


hMSH2
: the prefix
h

in front of the gene name indicates that it is the human version of the
gene


Before we begin the lab, read
Analyzing DNA Composition

on page 151 of
BFD

and
answer these questions.


Q1:


We are analyzing a single

sequence of DNA that represents the entire sample of


DNA that was obtained in the hypothetical lab. This sequence obviously
represents only one strand of the DNA that was extracted in the lab. The single
strand of DNA is denoted as
c
DNA (
complem
entary

DNA). How is
c
DNA
created? HINT: read the preface to Chapter 5 or GOOGLE
c
DNA.




cDNA stands for complementary DNA. It is the DNA copy of messenger RNA

and
is single stranded.



Bio/CS
-
251

Laboratory 2

Jan 31, 2007


Examining A Singl
e DNA Sequence

Q2
:

Why is the pairing between guanosine and cytos
ine nucleotides
more stable than



the pairing between adenosine and thymidine?



The G
-
C pairings are more stable because they are connected by three hydrogen
bonds. The A
-
T pairings are only connected by two hydrogen bonds.


Q3:
If we know the G+C count, can we find t
he frequency of all of the bases in the
sample of DNA that was obtained in the lab? How is this done?



Pairings can be either G
-
C or A
-
T. If the frequency of G
-
C pairings is known, the
remainder of the sample must be composed of A
-
T pairings.
Computer pr
ograms
such as Emboss can also be used.


OK, now on to the lab procedures


Procedure:
Collect your sequence from NCBI


Go to the NCBI web site for GenBank given in the URL at the top of this page.

a.

From the “
Search
” pull down menu, choose “
Gene



b.

In the “
Fo
r
” window type “
hMSH2
” and click “
Go



c.

Several references to the human versions of this gene are listed. Choose the
second entry,
MSH2
. Click on this entry


d.

You will be taken to a page that contains a variety of information about
research that has been d
one on this gene. Peruse this page.


Q4:

What is the complete name of this entry?


MSH2

mutS homolog 2, colon cancer,

nonpolyposis type 1 (E. coli) from
Homo sapiens


Q5:

How many papers have been written about this particular entry?

HINT:
You will need

to go to PubMed for this information. Follow the Links!


168 papers have been written about this entry.



Q6:

As you scroll

down the page you will
come to
a link to the GenBank page
that contains
the DNA sequence itself. How many base pairs long is the
sequence for this entry?


80098 base pairs


Bio/CS
-
251

Laboratory 2

Jan 31, 2007


Examining A Singl
e DNA Sequence

e.

Scroll back up to the top of the
GenBank
page and from the
Display

pulldown
menu chose
FASTA
.


f.

A new page will appear that contains the name of the entry and the listing of
the nucleotides in sequential order, bu
t
in
a different format from the one at
the bottom of the previous page. Copy all of this information into a word
document that you will save in your workspace as MSH2.doc. You are now
ready to begin your analysis.


Procedure:

follow page
s

152

in
BFD


Co
unting Words in

DNA
Sequences


Purpose:
to find the count for each of the nucleotides found in this sequence and also to
find the count for each of four significant triplets found in the sequence.


g.

After you obtain your result that will be formatted like
F
igure 5
-
4
,
copy and
paste it on a new page of your MSH2.doc.


Screen shot attached at end.



Q7:

What is the total G+C count for this sequence? Why are the



percentages of G and C that are shown so different? Is this a


violation of Charga
ff’s rules?


The G
-
C content is 41.60%. The percentage for G is 21.33% and

the
percentage for C is 20.27%. This is not a violation of Chargaff’s rules
because the sequence is for a single stranded cDNA sequence. If the
sequence was
for a
double stranded

DN
A segment
, it would be expected
that the percentages would be equal.



Q8:


Give the total count for each of the nucleotides in the strand of


DNA represented by this sequence.


A: 20890

C: 16235

G: 17088

T: 25885









Bio/CS
-
251

Laboratory 2

Jan 31, 2007


Examining A Singl
e DNA Sequence

Q9:

As you will learn, the trip
lets ATG, TAA, TAG, and TGA can


have a special significance in DNA sequences. What is the


frequency of each of these triplets in the sequence that you just


processed?


ATG: 1344

TAA: 1551

TAG: 1173

TGA: 1474


Procedure:
Follow the instructions on pa
ges 153


154 of
BFD


Purpose:

To search our sequence for the occurrence of any highly unusual repeat of a
long word (> 3 nucleotides in length)


The people who did the statistical analysis for the program BLAST (which we will begin
using next week) said t
hat it was below any reasonable level of statistical significance
that any sequence of length 11 would be repeated solely by random assignment of the
four letters: A, C, G, or T. Therefore, we may conclude that the repeat of an 11 letter
word is a signifi
cant finding in our sequence. We will look for a repeated sequence, but
not push it as far as 11. We will go with 5.


h.

Follow the instructions on pages 153


154 using a word length of 5. You
will have to recopy the sequence for MSH2 that you saved in yo
ur word
document.

NOTE: In instruction 3 there is no link at “Codon usage,
composition”. Just find that section on the web page and go to instruction 4
on page 153.


Q10:

How many 5

letter words are repeated
20
0 times or more
in the
sequence for MSH2?


There are 32 5
-
letter words that are repeated 200 times or more in the
sequence.














Bio/CS
-
251

Laboratory 2

Jan 31, 2007


Examining A Singl
e DNA Sequence

Q12:

List

(Copy and Paste)

these sequence(s).


TTTTT

1270

AAAAA

589

ATTTT

482

TTTTA

415

TTTTG

346

TATTT

343

TTTGT

289

TTTAA

283

TTTCT

277

TTTAT

273

AATTT

270

TG
TTT

269

TTATT

266

CTTTT

266

GTTTT

255

GCTGG

245

AAAAT

245

TTTTC

242

TCTTT

242

TTCTT

240

CCTCC

235

TTGTT

235

TAATT

234

TTAAA

226

TAAAA

226

CTGGG

217

AAATT

214

GCCTC

212

TGGGA

205

TATAT

205

TTTGA

202

TCCCA

201



Procedure:


Using a Dot
-
Plot to spot long word
s in a sequence.


Purpose:

To provide a streamlined visual method to perform the task of the previous



procedure.


i.

Follow the instructions on pages 155 and 156 of
BFD.
The web page will not
download with your graph so scroll up so that the entire graph

appears on you
screen. Then press
ALT

and
Print Scrn

at the same time. This will copy the
window that displays the graph. Paste
(
Ctrl
and
V
)
this

on a new page of
Bio/CS
-
251

Laboratory 2

Jan 31, 2007


Examining A Singl
e DNA Sequence

your WORD document. Save this document in a folder called Lab 1 on your
H drive. You sh
ould also save this completed Lab worksheet in that folder.



Sheet views are attached with different window sizes.


Q13:

Does this dot plot show any repeated word of significant length?



Think carefully before you answer this question.


The dark area
s on the dot plot show
words that are repeated a significant
number of times. The darker the area, the more times the word is repeated.
This makes sense because, earlier, there were found to be 32 five letter word
that were repeated more than 200 times.


A
n example of a repeated sequence with tragic consequences


Procedure
: Using OMIM (Online Mendelian Inheritance in Man) to examine




a genetic disease caused by repeat sequences


Purpose
:


Learn how to navigate OMIM



j.

Go to
http://www.ncbi.nlm.nih.gov/
. Under “Search”, choose “Gene”,


and type “HD” into the search box. Open Link #2, “
HD
”. Read the


Summary
, and then scroll to the bottom of the page. Under
NCBI


Reference Sequences (RefSeq),

op
en the link to the mRNA


sequence (
NM_002111),
then under “Display”, choose FASTA.


k
.

Examine the first six lines of the mRNA, and in the space below, record


a triplet sequence that is repeated in tandem more than 10 times:



Q14
:

Record
your
triplet

repeat here:


The triplet CAG

is repeated
, in tandem,

more than 10 times.



Q15
:

How many times is the triplet repeated (how many copies of the



triplet?)


The triplet is repeated 21

times.


l.

Return to the NCBI Entrez Gene page for the HD gene. Under




Additional Links
”, select
MIM:143100
, and open this link to the OMIM


database for the HD gene. You will find that this is a long and detailed


summary of everything that is known about the HD gene and its


pathology. Answer each of the following
questions briefly.



Bio/CS
-
251

Laboratory 2

Jan 31, 2007


Examining A Singl
e DNA Sequence


Q16
:

What disease is caused by alterations in the HD gene?



What organ system is affected by this disease?



(You may wish to view the “Clinical Synopsis” from the Table of



Contents along the left border of the page)


Huntington D
isease

Affects the central nervous system

Also with behavioral and psychiatric manifestations




Q17
:

From the Table of Contents, select “Allelic Variants”, read this



section, and answer the following question:

What is the molecular genetic basis for th
e disease? Explain
how repeat sequence variation is responsible for this disease.


The nucleotide sequence CAG is located in the region coding of the gene for
Huntington disease. The sequence is repeated between 9 and 37 times in
normal individuals, but be
tween 37 and 86 times in affected individuals.

Because the sequence is repeated in a coding section of the gene, the protein
will contain extra amino acids that would not normally be present. These
extra amino
acids affect the protein and it
s functions in
the cells.





I affirm that I have upheld the highest principles of honesty and integrity in
my academic work and have not witnessed a violation of the Honor Code.






Christine Frielle