Sequence Complexityx - Resources

thingyoutstandingΒιοτεχνολογία

1 Οκτ 2013 (πριν από 4 χρόνια και 1 μήνα)

73 εμφανίσεις

Sequence Complexity: Masking Low Quality Regions and Measuring
Complexity Across Species and GenBank Contributors

By Gab
riel Proulx

Abstract


Sequencing genomes still requires considerable equipment and resources. While a full
sequencing of all known
species is obviously the end goal, ESTs can suffice in the meantime as easier and
less resource
-
intensive method
s to mapping out a species’ expressed genes. ESTs, however have
their
disadvantages.

They are limited in both
length and quality. Often times, d
ue to reading errors or
polymerase slippage or contamination, ESTs include low quality segments. While there are several
methods to remedy the low quality of EST reads, they often focus on finding repeated segments rather
than general low complexity. We fe
el that through a simple algorithm measuring the complexity
(number of changes from one nucleotide to another as one ite
rates through a sequence) we have

improve
d

upon the existing programs and
hopefully
allow for
a
better alignment of ESTs.

I
t could also
be
of biological significance to analyze the complexity distributions of several model organisms
’ ESTs stored
on Genbank
.

Introduction


While working with
and
analyzing ESTS from
Gossypium raimondii
, a variety of cotton, BYU’s Dr.
Joshua Udall noticed many

of the DNA reads appeared to display low complexity. This is characterized
by long contiguous strings of the same DNA base rather than a more random looking sequence that
would be expected. It is known that ESTs are often ridden with various problems
from

“sequencing
artifacts, contamination, low quality sequence and genomic repeats


(Malde, Schneeberger, Coward, &
Jonassen, RBR: library
-
less
repeat detection for ESTs 2232)
. Regardless of the cause, sequences of low
complexity hinder the ability to align E
ST sequences together. Because of the inexpensive and useful
nature of EST sequencing and mapping, and the prohibitive cost of a full genomic sequencing, it is
desirable to improve EST clustering by throwing out sequences of low
-
complexity.


Removing seque
nces of low
-
complexity necessitates standards for what is considered complex.
Rather than

setting

an arbitrary threshold of complexity, a more statistically sound method was used in
this project.
We

gather
ed

a set of gene sequences from either the species
in question or a similar
species, and analyze
d

the complexity. After analyzing the complexity of these genes, statistical methods
were used to identify the mean, variance, and ultimately a threshold of complexity.
Now knowing a
statistical threshold for c
omplexity, if

a certain EST failed to meet this threshold of complexity, the
researcher can now throw out the sequence of low
-
complexity with a statistical level of confidence that
it does not conform to the data already online in Genbank.


Furthermore, it is known that “species differ vastly in their genomic composition,
from 74% G+C
(
Micrococcus capricolum
) down to only 25% G+C (
Mycoplasma luteus
)


(Kni
ght, Freeland, & Landweber
55
)
. But do these differences in genomic composition

also rel
ate to variance

in sequence complexity? By
keeping the program created to identify a complexity threshold open
-
ended, it can also be used to
analyze and compare the genetic complexity across several species to determine whether there is a
significant diffe
rence from one species to the next, or whether all species share a common range of
complexity.

Materials and Methods


This project took a significant amount of time
to plan
as well as
to execute
. In the end, the
methods differed
from how they were first e
nvisioned
, but they yielded the needed results. Our goal
was to develop a program that would estimate the average complexity and
variance

for any species
required.

After those values were gathered, they would be used to create a threshold that would serve
as a quality cutoff when analyzing a newly sequenced batch of ESTs. If the complexity of an EST fell
below the threshold, the researcher could throw out the EST from the dataset.


To do this we had to prove a few assumptions. First, we had to prove that th
e complexity
of a
species’ ESTs were indeed

normally distributed. Second, we had to show that the average complexity
differed significantly from one species to another to justify creating a program that would compute the
complexity for a specific species.
Third, we had to demonstrate that by selecting and analyzing a subset
of the full ESTs on Genbank we could come up with an effective method to estimate the population
average and standard deviation. If any of these assumptions failed, there would be little

purpose in
creatin
g a program that would estimate a species average complexity and variance.

To prove these assumptions, we decided to analyze several model organisms:
Oryza sativa

(rice),
Bos taurus

(cow),
Rattus norvegicus

(rat),
Homo sapiens

(humans),

Drosophila melanogaster

(fruit
fly),
Saccharomyces cerevisiae

(yeast),
Chlamydomonas reinhardtii, Caenorhabditis elegans
, and
Arabidopsis thaliana
.

Because of Dr. Udall’s work with cotton, we also analyzed
Gossypium hirsutum,
and

Gossypium raimondii
.
Knowing that we needed access to Genbank’s information, specifically their
EST databases, we first download
ed FASTA files representing the full collection
ESTs

for every species on
Genbank
. These files turned out to be ove
r 40gb and would require lengthy p
rocesses of

parsing and
indexing to provide the
access we needed. We abandoned this tactic and researched other methods of
accessing Genbank’s collection of ESTs.

Our solution came in the form of Entrez Utilities, a collection of utilities that can be used

from
Perl

to retrieve Entrez queries against

Genbank’s

EST database
. We made a few scripts to download
the
model sequences. This proved to be more difficult than anticipated. We used
esearch

to query the EST
database for the organism name, and then used t
he LWP::Simple::get() method and
efetch

to retrieve
the results. Although there is no limit to the size that can be retrieved at one time, we ran into problems
downloading an organism’s full collection of ESTs. While
Gossypium raimondii

has
only
63
000+ EST
s
online and could be retrieved in one sweep,
Homo sapiens

has 8000
000 + ESTs and kept timing out.
Even after modifying the program to d
ownload sequences in sets of 50
000 which were concatenated
later, it still took hours and hours to download the full col
lection. In fact,
there
were 95
174 ESTs from
Homo sapiens

that were never downloaded because of these problems

despite days of effort and
multiple attempts
.

After finally downloading the ESTs, we created a simple program to analyze the complexity of
t
he s
equences. The program
use
s

BioPerl to load the sequence and then iterate
s

through each base and
increment
s

a counter each time the iterator saw a new base
. This complexity count is

then be
normalized to the sequence size by dividing by the sequence
length.

Every time a sequence i
s analyzed,
our
program

output
s to a
CSV file with the sequence ID and the complexity (a number usually around
0
.7
0
).
At the end of analyzing a whole FASTA file
of sequences, our program closes

the

CSV file and
outputs

the average c
omplexity of the species and the standard deviation of the complexities.

It was soon discovered that Excel cannot open CSVs with 8,000,000 columns. Again, we created
a Perl script that iterated through each column of the CSV, created bins with .01 increme
nts, and
produced histogram data for each species. Finally this data could be viewed in Excel and used to create
histograms to
visually
represent the complexity distribution of each of our species, which were in fact
normally distributed.

In Excel we creat
ed a table that provided hypothesis testing to determine whether the average
complexities of the model organisms were significantly different. After seeing that they were, we had
proved two of our three assumptions, that the complexities were normally dist
ributed and that the
differences between average complexities were

statistically

significant from one species to another.

Our final assumption to prove was to show that we could get an accurate estimate of the

average complexity and standard deviation

by analyzing only a sample of the full population of ESTs. We
wanted to prove this so that the researcher would be afforded a tool to estimate a species’ genetic
complexity in a relatively quick manner without having to download the whole collection of ES
Ts. Again,
we wrote a few scripts that analyzed 100, 10000, and 100000 relatively random ESTs and estimated the
species

genetic complexity and standard deviation

from those three sample sizes. Because the ESTs are
stored in chronological order and we wante
d to block against biases related to chronology, the source of
the ESTs, and other factors that could affect our sample, we tried to get a random selection. Of course
this is rather difficult to do, but we finally approached it in the following
way. If we
were downloading X
ESTs from a full collection of Y ESTs
, we would break both our sample and population size up in 100ths
and then pull a 100
th

of the sample at each 100
th

increment of the population of ESTs. So if we were
downloading 100 ESTs from a speci
es that had 1000 ESTs in total, we would download sequences 0
-
10,
100
-
110, 200
-
210, etc. This provided for a relatively simple way to get an approximation of a random
sample.

The three sample sizes of 1000, 10000, and 100000 were downloaded for each speci
es and the
estimates for the average complexities and standard deviation were compared to the actual values. To
compare the estimates with the actual values, we created thresholds using both. We took the actual
average complexity and subtracted
1.64

standa
rd deviations from that value to find a threshold above
which approximately 95% of Genbank’s ESTs had a higher complexity. We then did

this

with the
estimate
d

values and measured the difference in thresholds. We also quantified what percentage error
this r
epresented considering the actual average and standard deviation
,

and assuming a normal
distribution. For example if our estimate threshold was only so accurate that it gave us a value above
which 90% of Genbank’s ESTs had a higher complexity, the percenta
ge error of this estimate would be
5% because it represents an error of 5% of the population. Similarly an estimate above which 98% of
Genbank’s ESTs had a higher complexity would represent an error of 3%. We created graphs to indicate
that our estimates p
rovided
accuracy better than 5% for samples of 1000, better than 2.5% for samp
les
of 10000 and better than 1.6
% for samples of 100000.

With all of our assumptions proven, we created our functional programs. Our program
getcomplex.pl

prompts the user for a

genus, species, sample size, and number of blocks to break the
sample size into
,

and then uses the previously outlined methods to download the sample

to a FASTA file
,
analyze the complexities
, write a CSV file with the sequence
ID
s and complexities
, analy
ze the CSV and
create

another CSV with histogram data with bins of .01 increments, and outputs the estimated average
complexity and standard deviation. This allows a researcher to analyze any documented species,

have a
sample of the ESTs in a FASTA file,

l
ook at the complexity distribution of that species

in histogram form
,
and get an estimate for the average complexity and standard deviation of complexities.

With that data, the researcher can use
sort.pl

to weed out low complexity sequences from
a
FASTA f
iles of ESTs. This program prompts the user for a FASTA file of sequences, an average complexity,
a standard deviation value, and the number of standard deviations to use in calculating a threshold (
1.64

is recommended to provide for a 95% confidence level
). Then
sort.pl

will take the average, subtract
1.64

(or whatever the user provided) times the standard deviation, and calculate a complexity threshold. The
program then iterates through every sequence in the FASTA file and stores the sequences with
comple
xities above the threshold in one file, and stores the low
-
complexity sequences in another file. If
the default value was used, this would mean that all sequences in the low
-
complexity file
would have a
complexity
below 95% of Genbank’s ESTs. This program
allows for versatility
and a
ccommodate
s

the
researcher’s needs.

Results


After downloading the datasets for each model organism, the complexities were calculated and
examined in a histogram form. The histograms are shown in
Figure A
.

Thes
e graphs show
what we
suspected;

overall, the complexities of a species ESTs are normally distributed. The average complexity
was around 0.70. Most species followed this normal distribution fairly well, there were a few exceptions.
Most notably, the reason we started th
is project was due to low
-
complexity sequences from
Gossypium
raimondii
. This species’ graph shows a slight
ly

bimodal character as seen in
Figure B
. On top of having a
mode at the normal around 0.70,
G. raimondii
has another much smaller mode around 0.175
. Not only is
this odd and uncharacteristic of our other model organisms, but it also drastically affects our estimation
of the average and standard deviation. While most other species had an average around
0
.7
0

and a
standard deviation
abou
t 0.05,
G.
raimondii

had an average of 0.66 and a standard deviation of 0.12.
Our program
getcomplex.pl

allows the researcher to analyze the histogram

data for a species

to guard
against abnormal behavior such as this

which results in poor estimates
.


We tried to lo
cate the source of or reason behind this abnormality

in
G. raimondii
. We
downloaded all ESTs from the complexity range of 0.15 to 0.25 and compared these sequences to
those
with a complexity of 0.70. Parsing the records for indications of why such a large
number of sequences
have low
-
complexity
left us with few clues
. The sequences from both modes seemed to be on the same
plates, and in the same rows and columns. They all seemed to be submitted in the same batch. We also
looked at the sources. All of the lo
w
-
complexity sequences were submitted by the University of Arizona.
We
felt this might be a clue
, and it still might be, but we also discovered that with the exception of a few
records

from the University of Georgia
, all of the sequences with 0.70 complexi
ty were also from the
University of Arizona. Because a 0.20 complexity represents a very simple sequence, we have a hard
time believing that these sequences are biologically significant and are actually being transcribed in
G.
raimondii
. This position is a
lso supported by the lack of
this bimodal behavior existing to this extent

in
any of the other
model
species

we analyzed
, most importantly the close relative
G. hirsutum.

It is also
important to remember that
G. raimondii’s

EST collection is the smallest o
f our model organisms with
63577 records and that a majority of those records seem to be from one contributor. This observation
leaves
G. raimondii

prone to scientifically insignificant aberrations.


Although
G. raimondii

was the only species to display

r
adical
abnormalities, some other species
displayed skewed distributions or slight bimodal

behavior, although other
bimodal behavior was not as
pronounced.
A. thaliana, B. taurus,
and
R. norvegicus
displayed a slight skew to the left
.
D. melanogaster
also d
isplayed a distribution skewed to the left with a slight mode on the tail

as shown in
Figure C
. Lack
of time and
a lack of an effective method to analyze these abn
ormalities precluded a further
investigation of these behaviors

by us
.


After computing the
graphs, we used hypothesis testing to determine whether these species’
differences in mean complexity were statistically signif
icant. Every combination of
species were
compared and the results are displayed in
Figure D.
Although the graph is difficult to u
nderstand,
the
tan cells are p
-
values that represent the probability that the two species being compared have the same
mean and that the differences in these samples were due to random chance. As seen, 3.35E
-
15 is the
largest probability value, therefore w
e can be
very
confident that

these

differences between species’
complexity averages are statistically significant. Whether these differences are biological or due to the
quality of sequencing and vector tr
imming, however, is not known at this time.



To ju
stify writing a program to take a sample of a species’ ESTs and estimate the average
complexity and standard deviation, we had to prove that this would give reliable results. We tested
sample sizes of 1000, 10000, and 100000 ESTs and compared the values th
ese samples gave us to the
actual values

from the full datasets

in
Figure E
.
The values were quite good. As could be expected, with a
larger sample size, the accuracy generally improved. We used these values to estimate a threshold that
would include the t
op 95% sequences in the high
-
complexity group and then compared these estimated
thresholds to our actual values. Assuming a normal distribution, our estimated values at 100000 samples
resulted in a 0% to a 1.6% error. Since we graphed the absolute value of

the percent error
,

this means
,

for example,
that our estimated threshold with the 1.6% error would either include 96.6% or 93.4% of
high complexity sequences in our sorting method, when we were shooting for a 95% rate. The accuracy
of these sample sizes w
ere graphed in
Figure F
. Even a sample size of 10000 results in fairly accurate
results with a 0%
-

2.4% range of error. Our 1000 sized samples resulted in a 0.1% to 5% range in error,
which is very g
ood considering a 1000 sequence sample of
H. sapiens
with its
8068
722 ESTs represents a
sample size covering .012% of the

full population and resulted

in an estimate with only 1.8% error at the
95% threshold.


After analyzing the previous data, we programmed
getcomplex.pl
. This

is a tool for researchers
to
analyze the complexity distributions of a
user
-
supplied
species and estimate the average complexity
and standard deviation of complexity. This command line program takes the following parameters:


<Number of ESTS to grab> <Number of blocks to brea
k up sample size into> <Genus>
<Species> <Whether or not to output histogram data (y/n
)
>
.
Figure G

shows a sample use of
this program. If these command line arguments are omitted,
getcomplex.pl

prompts the user for them.

Samples of the various output files

are included in
Figure H.


Our second program
sort.pl
is used to sort a FASTA file of EST sequences into high
-
complexity
and low
-
complexity files
. It takes the following command line arguments:
<Input FASTA file>
<Average complexity> <Standard deviation>
<N
umber of standard deviations below average
to set threshold at. If omitted, use 95% confidence>.
Figure
I

shows a sample use of this
program. If these command line arguments are omitted,
sort.pl

will also prompt the user

for them
.
The
output is in two FA
STA files.
Both programs are more fully documented in their source code and are
available at
http://psoda4.cs.byu.edu/~arkangel/sort.pl

and
http://psoda4.cs.byu.edu/~arkangel/getcomplex.pl
.

A sample EST from
both
the high and low output
files is include
d

in
Figure J
.


Conclusion


We were able to prove all of the assumptions that were necessary to justify creating
getcomplex.pl

and
sort.pl
. That is, we showed that the complexity of ESTs stored on Genbank follows a
normal distribution,
that
there are statistically significant differences between the average complexities
from species to species, and that a
small
sample can be used to g
ive an accurate estimation of the full
collection of a species’ ESTs.


Along the way we discovered some interesting trends. There are several species that show
abnormal behavior in their normal distributions, specifically
G. raimondii

which actually
serve
d as

our

motivation to undertake this analysis. The distribution of complexities for
G. raimondii
showed slight
bimodal distribution, which was also seen to a lesser extent in
D. melanogaster
. Other species showed
abnormal tails. Investigation into
G. raim
ondii

left us with few clues as to why this behavior is observed.
All of the low
-
complexity sequences we analyzed were from the same source which would lead us to
believe that there might be a correlation, although this discovery is made less significant c
onsidering the
relatively small number of total ESTs on Genbank and the high percentage of the total ESTs sequenced
by the source in question.


Our tools
getcomplex.pl
and
sort.pl

are effective in what they were design
ed to do
. Both tools
have been progra
mmed with versatility in mind to maximize their application. Although the difference in
average complexity between species is statistically signifi
cant, it is still unclear as to whether there is
enough variation to justify the use of
getcomplex.pl

for eve
ry new species considered, or whether a
more general threshold could be used with similar results. However,
getcomplex.pl

was programmed
with the ability to also download, store, and analyze the complexities for a species’ full collection of ESTs
rather th
an just a sample
,

to
make it useful in further

studies analyzing either the quality of Genbank’s
data

for a species
, or the biological differences in complexity

between species
.


Considering our findings, there are several areas in which further study wou
ld be beneficial. The
issue of bimodal distributions and abnormal distribution behavior for several of the model species could
benefit from further research. Perhaps an investigation into the methods used to analyze the ESTs for
G.
raimondii

at the Univers
ity of Arizona could be launched to identify whether this abnormality was
indeed a biological finding, or whether it was due to errors in the laboratory processes. Also, the process
of discarding low
-
complexity sequences using
sort.pl

requires justificatio
n and further research is
needed
to determine the reasons behind low
-
complexity sequences. If a researcher is discarding large
portions of low complexity sequences, rather than moving forward as planned, there should be
investigations into the laboratory p
rocesses to make sure that whatever may be the problem is rectified
to preserve the integrity of the sequences considered high
-
complexity. Finally, some of the alleged high
-
complexity sequences showed marked bias in terms of base composition, eg.
“ATATATAT
TTATATATATATA”. Our tools could be simply modified to do a more general analysis of
complexity which would also consider base composition. It would be interesting to
see if such an
analysis would highlight other statistically significant trends, although t
here seems to be much more
work already devoted to this study.

Bibliography

Knight, R. D., Freeland, S. J., & Landweber, L. F. (2001). Rewiring the Keyboard: Evolvability of the
Genetic Code.
Nature Reviews: Genetics

, 2
, 49
-
58.

Malde, K., Schneeberger, K., Coward, E., & Jonassen, I. (2006). RBR: library
-
less repeat detection for
ESTs.
Bioinformatics

, 22

(18), 2232

2236.
















Appendix

Figure A.










S. cerevisiae










AVG

0.70354225








STDEV

0.030872309








n

34915








sq(stdev)/n

2.72977E
-
08













































































































A. thaliana










AVG

0.701177533








STDEV

0.068968142








n

1526133








sq(stdev)/n

3.11677E
-
09













































































































B. taurus










AVG

0.700498458

0
1000
2000
3000
4000
5000
6000
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
Saccharomyces cerevisiae

0
20000
40000
60000
80000
100000
120000
140000
160000
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
Arabidopsis thaliana

0
20000
40000
60000
80000
100000
120000
140000
160000
180000
200000
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
Bos taurus








STDEV

0.048474898








n

1517139








sq(stdev)/n

1.54885E
-
09













































































































C. elegans










AVG

0.720108214








STDEV

0.043045343








n

352043








sq(stdev)/n

5.26328E
-
09













































































































C. reinhardt










AVG

0.743186758








STDEV

0.040043046








n

202044








sq(stdev)/n

7.93612E
-
09














































0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
Caenorhabditis elegans

0
5000
10000
15000
20000
25000
30000
35000
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
Chlamydomonas reinhardt
































































D.
melanogaster










AVG

0.719327599








STDEV

0.049511605








n

573749








sq(stdev)/n

4.2726E
-
09













































































































G.
hirsutum










AVG

0.696788579








STDEV

0.045301953








n

268775








sq(stdev)/n

7.63563E
-
09













































































































G. raimondii



0
10000
20000
30000
40000
50000
60000
70000
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
Drosophila melanogaster

0
5000
10000
15000
20000
25000
30000
35000
40000
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
Gossypium hirsutum

0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
Gossypium raimondii








AVG

0.664671947








STDEV

0.122276641








n

63577








sq(stdev)/n

2.35173E
-
07













































































































H. sapiens










AVG

0.704307195








STDEV

0.049948167








n

8068722








sq(stdev)/n

3.09196E
-
10













































































































O. sativa










AVG

0.725641103








STDEV

0.064738499








n

1220909








sq(stdev)/n

3.43275E
-
09





































0
100000
200000
300000
400000
500000
600000
700000
800000
900000
1000000
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
Homo sapiens

0
20000
40000
60000
80000
100000
120000
140000
160000
180000
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
Oryza sativa









































































R.
norvegicus










AVG

0.702131811








STDEV

0.067169628








n

951375








sq(stdev)/n

4.74236E
-
09


















































































Figure B.


0
20000
40000
60000
80000
100000
120000
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
Rattus norvegicus

This graph shows the bimodal character of
G. raimondii’s

ESTs on Genbank.

Figure C.


D. melanogaster

also showed slight abnormal bimodal behavior in its distribution.

Figure D.


Hypothesis testing comparing each species’ values to every other species.

The yellow square serves as a
key for the center squares, while the upper right box is a key for the blue boxes. The highlighted tan
numbers are probability values that the two species’ mean were the same and that any variation is
attributable to random c
hance.
As seen, 3.35E
-
15 is the largest p
-
value and corresponds to over a
99.9999999999% probability that these populations have statistically different means.

Figure E.

Species/n



Population

1000

10000

100000

A. thaliana

Avg

0.701177533

0.704953844

0.701818485

0.70012193

1526133

StDev

0.068968142

0.060268948

0.067891779

0.068756037

B. taurus

Avg

0.700498458

0.704789524

0.703444384

0.702401342

1517139

StDev

0.048474898

0.040374354

0.04451912

0.045552423

C. elegans

Avg

0.720108214

0.718904761

0.719863962

0.719264242

352043

StDev

0.043045343

0.044135786

0.041741025

0.043294473

C. reinhardtii

Avg

0.743186758

0.745321006

0.742471104

0.742788144

202044

StDev

0.040043046

0.038213564

0.042210503

0.040420094

D. melanogaster

Avg

0.719327599

0.718491077

0.720603059

0.718887618

573749

StDev

0.049511605

0.050877726

0.048422207

0.050121207

G. hirsutum

Avg

0.696788579

0.695711555

0.696861365

0.697040056

268775

StDev

0.045301953

0.04441335

0.045120819

0.044956093

G. raimondii

Avg

0.664671947

0.666439442

0.664654113

0.664630431

63577

StDev

0.122276641

0.12192608

0.122039799

0.122340832

H. sapiens

Avg

0.704307195

0.703982785

0.703725897

0.703115552

8068722

StDev

0.049948167

0.044982798

0.046844251

0.048067702

O. sativa

Avg

0.725641103

0.727830913

0.724950541

0.725200084

1220909

StDev

0.064738499

0.061083872

0.061916799

0.067409556

R. norvegicus

Avg

0.702131811

0.700817867

0.70150445

0.70288693

951375

StDev

0.067169628

0.068064136

0.068477837

0.065844582

S. cerevisiae

Avg

0.70354225

0.702401286

0.703579347

0.70354411

34915

StDev

0.030872309

0.031368752

0.030343204

0.030864195


This chart is a comparison of how accurate the estimates for average complexity and standard deviation
were. The sample sizes are 1000, 10000, and 100000 and
are compared against the actual values from
the full population.

Figure F.


This chart shows that if assuming a normal distribution, estimating the threshold to include the top 95%
of sequences by using 100000, results in at max about a 1.5% error and at
best a negligible error.

0.0%
1.0%
2.0%
3.0%
4.0%
5.0%
6.0%
Cumulative percentage difference from
threshold of full data set

1000 Samples
10000 Samples
100000 Samples
Figure G.


First couple lines of output from
getcomplex.pl

with and without command line arguments.

Figure H.


File containing results of complexity analysis.


Portion of the histogram CSV created to analyze the distribution
of complexities.


Portion of the CSV created with IDs and complexity values. This will be useful in trying to identify
abnormal behavior in the distributions.

Figure I
.


Output from
sort.pl

with and without command line arguments.

Figure J.



Top sequence is a high complexity sequence sorted by
sort.pl
and the bottom is a low complexity
sequence from the output files. These are sequences from our test data set based off a ESTs collection
on
G. raimondii

provided by Dr. Udall.