Sequence Complexity: Masking Low Quality Regions and Measuring
Complexity Across Species and GenBank Contributors
By Gab
riel Proulx
Abstract
Sequencing genomes still requires considerable equipment and resources. While a full
sequencing of all known
species is obviously the end goal, ESTs can suffice in the meantime as easier and
less resource

intensive method
s to mapping out a species’ expressed genes. ESTs, however have
their
disadvantages.
They are limited in both
length and quality. Often times, d
ue to reading errors or
polymerase slippage or contamination, ESTs include low quality segments. While there are several
methods to remedy the low quality of EST reads, they often focus on finding repeated segments rather
than general low complexity. We fe
el that through a simple algorithm measuring the complexity
(number of changes from one nucleotide to another as one ite
rates through a sequence) we have
improve
d
upon the existing programs and
hopefully
allow for
a
better alignment of ESTs.
I
t could also
be
of biological significance to analyze the complexity distributions of several model organisms
’ ESTs stored
on Genbank
.
Introduction
While working with
and
analyzing ESTS from
Gossypium raimondii
, a variety of cotton, BYU’s Dr.
Joshua Udall noticed many
of the DNA reads appeared to display low complexity. This is characterized
by long contiguous strings of the same DNA base rather than a more random looking sequence that
would be expected. It is known that ESTs are often ridden with various problems
from
“sequencing
artifacts, contamination, low quality sequence and genomic repeats
”
(Malde, Schneeberger, Coward, &
Jonassen, RBR: library

less
repeat detection for ESTs 2232)
. Regardless of the cause, sequences of low
complexity hinder the ability to align E
ST sequences together. Because of the inexpensive and useful
nature of EST sequencing and mapping, and the prohibitive cost of a full genomic sequencing, it is
desirable to improve EST clustering by throwing out sequences of low

complexity.
Removing seque
nces of low

complexity necessitates standards for what is considered complex.
Rather than
setting
an arbitrary threshold of complexity, a more statistically sound method was used in
this project.
We
gather
ed
a set of gene sequences from either the species
in question or a similar
species, and analyze
d
the complexity. After analyzing the complexity of these genes, statistical methods
were used to identify the mean, variance, and ultimately a threshold of complexity.
Now knowing a
statistical threshold for c
omplexity, if
a certain EST failed to meet this threshold of complexity, the
researcher can now throw out the sequence of low

complexity with a statistical level of confidence that
it does not conform to the data already online in Genbank.
Furthermore, it is known that “species differ vastly in their genomic composition,
from 74% G+C
(
Micrococcus capricolum
) down to only 25% G+C (
Mycoplasma luteus
)
”
(Kni
ght, Freeland, & Landweber
55
)
. But do these differences in genomic composition
also rel
ate to variance
in sequence complexity? By
keeping the program created to identify a complexity threshold open

ended, it can also be used to
analyze and compare the genetic complexity across several species to determine whether there is a
significant diffe
rence from one species to the next, or whether all species share a common range of
complexity.
Materials and Methods
This project took a significant amount of time
to plan
as well as
to execute
. In the end, the
methods differed
from how they were first e
nvisioned
, but they yielded the needed results. Our goal
was to develop a program that would estimate the average complexity and
variance
for any species
required.
After those values were gathered, they would be used to create a threshold that would serve
as a quality cutoff when analyzing a newly sequenced batch of ESTs. If the complexity of an EST fell
below the threshold, the researcher could throw out the EST from the dataset.
To do this we had to prove a few assumptions. First, we had to prove that th
e complexity
of a
species’ ESTs were indeed
normally distributed. Second, we had to show that the average complexity
differed significantly from one species to another to justify creating a program that would compute the
complexity for a specific species.
Third, we had to demonstrate that by selecting and analyzing a subset
of the full ESTs on Genbank we could come up with an effective method to estimate the population
average and standard deviation. If any of these assumptions failed, there would be little
purpose in
creatin
g a program that would estimate a species average complexity and variance.
To prove these assumptions, we decided to analyze several model organisms:
Oryza sativa
(rice),
Bos taurus
(cow),
Rattus norvegicus
(rat),
Homo sapiens
(humans),
Drosophila melanogaster
(fruit
fly),
Saccharomyces cerevisiae
(yeast),
Chlamydomonas reinhardtii, Caenorhabditis elegans
, and
Arabidopsis thaliana
.
Because of Dr. Udall’s work with cotton, we also analyzed
Gossypium hirsutum,
and
Gossypium raimondii
.
Knowing that we needed access to Genbank’s information, specifically their
EST databases, we first download
ed FASTA files representing the full collection
ESTs
for every species on
Genbank
. These files turned out to be ove
r 40gb and would require lengthy p
rocesses of
parsing and
indexing to provide the
access we needed. We abandoned this tactic and researched other methods of
accessing Genbank’s collection of ESTs.
Our solution came in the form of Entrez Utilities, a collection of utilities that can be used
from
Perl
to retrieve Entrez queries against
Genbank’s
EST database
. We made a few scripts to download
the
model sequences. This proved to be more difficult than anticipated. We used
esearch
to query the EST
database for the organism name, and then used t
he LWP::Simple::get() method and
efetch
to retrieve
the results. Although there is no limit to the size that can be retrieved at one time, we ran into problems
downloading an organism’s full collection of ESTs. While
Gossypium raimondii
has
only
63
000+ EST
s
online and could be retrieved in one sweep,
Homo sapiens
has 8000
000 + ESTs and kept timing out.
Even after modifying the program to d
ownload sequences in sets of 50
000 which were concatenated
later, it still took hours and hours to download the full col
lection. In fact,
there
were 95
174 ESTs from
Homo sapiens
that were never downloaded because of these problems
despite days of effort and
multiple attempts
.
After finally downloading the ESTs, we created a simple program to analyze the complexity of
t
he s
equences. The program
use
s
BioPerl to load the sequence and then iterate
s
through each base and
increment
s
a counter each time the iterator saw a new base
. This complexity count is
then be
normalized to the sequence size by dividing by the sequence
length.
Every time a sequence i
s analyzed,
our
program
output
s to a
CSV file with the sequence ID and the complexity (a number usually around
0
.7
0
).
At the end of analyzing a whole FASTA file
of sequences, our program closes
the
CSV file and
outputs
the average c
omplexity of the species and the standard deviation of the complexities.
It was soon discovered that Excel cannot open CSVs with 8,000,000 columns. Again, we created
a Perl script that iterated through each column of the CSV, created bins with .01 increme
nts, and
produced histogram data for each species. Finally this data could be viewed in Excel and used to create
histograms to
visually
represent the complexity distribution of each of our species, which were in fact
normally distributed.
In Excel we creat
ed a table that provided hypothesis testing to determine whether the average
complexities of the model organisms were significantly different. After seeing that they were, we had
proved two of our three assumptions, that the complexities were normally dist
ributed and that the
differences between average complexities were
statistically
significant from one species to another.
Our final assumption to prove was to show that we could get an accurate estimate of the
average complexity and standard deviation
by analyzing only a sample of the full population of ESTs. We
wanted to prove this so that the researcher would be afforded a tool to estimate a species’ genetic
complexity in a relatively quick manner without having to download the whole collection of ES
Ts. Again,
we wrote a few scripts that analyzed 100, 10000, and 100000 relatively random ESTs and estimated the
species
genetic complexity and standard deviation
from those three sample sizes. Because the ESTs are
stored in chronological order and we wante
d to block against biases related to chronology, the source of
the ESTs, and other factors that could affect our sample, we tried to get a random selection. Of course
this is rather difficult to do, but we finally approached it in the following
way. If we
were downloading X
ESTs from a full collection of Y ESTs
, we would break both our sample and population size up in 100ths
and then pull a 100
th
of the sample at each 100
th
increment of the population of ESTs. So if we were
downloading 100 ESTs from a speci
es that had 1000 ESTs in total, we would download sequences 0

10,
100

110, 200

210, etc. This provided for a relatively simple way to get an approximation of a random
sample.
The three sample sizes of 1000, 10000, and 100000 were downloaded for each speci
es and the
estimates for the average complexities and standard deviation were compared to the actual values. To
compare the estimates with the actual values, we created thresholds using both. We took the actual
average complexity and subtracted
1.64
standa
rd deviations from that value to find a threshold above
which approximately 95% of Genbank’s ESTs had a higher complexity. We then did
this
with the
estimate
d
values and measured the difference in thresholds. We also quantified what percentage error
this r
epresented considering the actual average and standard deviation
,
and assuming a normal
distribution. For example if our estimate threshold was only so accurate that it gave us a value above
which 90% of Genbank’s ESTs had a higher complexity, the percenta
ge error of this estimate would be
5% because it represents an error of 5% of the population. Similarly an estimate above which 98% of
Genbank’s ESTs had a higher complexity would represent an error of 3%. We created graphs to indicate
that our estimates p
rovided
accuracy better than 5% for samples of 1000, better than 2.5% for samp
les
of 10000 and better than 1.6
% for samples of 100000.
With all of our assumptions proven, we created our functional programs. Our program
getcomplex.pl
prompts the user for a
genus, species, sample size, and number of blocks to break the
sample size into
,
and then uses the previously outlined methods to download the sample
to a FASTA file
,
analyze the complexities
, write a CSV file with the sequence
ID
s and complexities
, analy
ze the CSV and
create
another CSV with histogram data with bins of .01 increments, and outputs the estimated average
complexity and standard deviation. This allows a researcher to analyze any documented species,
have a
sample of the ESTs in a FASTA file,
l
ook at the complexity distribution of that species
in histogram form
,
and get an estimate for the average complexity and standard deviation of complexities.
With that data, the researcher can use
sort.pl
to weed out low complexity sequences from
a
FASTA f
iles of ESTs. This program prompts the user for a FASTA file of sequences, an average complexity,
a standard deviation value, and the number of standard deviations to use in calculating a threshold (
1.64
is recommended to provide for a 95% confidence level
). Then
sort.pl
will take the average, subtract
1.64
(or whatever the user provided) times the standard deviation, and calculate a complexity threshold. The
program then iterates through every sequence in the FASTA file and stores the sequences with
comple
xities above the threshold in one file, and stores the low

complexity sequences in another file. If
the default value was used, this would mean that all sequences in the low

complexity file
would have a
complexity
below 95% of Genbank’s ESTs. This program
allows for versatility
and a
ccommodate
s
the
researcher’s needs.
Results
After downloading the datasets for each model organism, the complexities were calculated and
examined in a histogram form. The histograms are shown in
Figure A
.
Thes
e graphs show
what we
suspected;
overall, the complexities of a species ESTs are normally distributed. The average complexity
was around 0.70. Most species followed this normal distribution fairly well, there were a few exceptions.
Most notably, the reason we started th
is project was due to low

complexity sequences from
Gossypium
raimondii
. This species’ graph shows a slight
ly
bimodal character as seen in
Figure B
. On top of having a
mode at the normal around 0.70,
G. raimondii
has another much smaller mode around 0.175
. Not only is
this odd and uncharacteristic of our other model organisms, but it also drastically affects our estimation
of the average and standard deviation. While most other species had an average around
0
.7
0
and a
standard deviation
abou
t 0.05,
G.
raimondii
had an average of 0.66 and a standard deviation of 0.12.
Our program
getcomplex.pl
allows the researcher to analyze the histogram
data for a species
to guard
against abnormal behavior such as this
which results in poor estimates
.
We tried to lo
cate the source of or reason behind this abnormality
in
G. raimondii
. We
downloaded all ESTs from the complexity range of 0.15 to 0.25 and compared these sequences to
those
with a complexity of 0.70. Parsing the records for indications of why such a large
number of sequences
have low

complexity
left us with few clues
. The sequences from both modes seemed to be on the same
plates, and in the same rows and columns. They all seemed to be submitted in the same batch. We also
looked at the sources. All of the lo
w

complexity sequences were submitted by the University of Arizona.
We
felt this might be a clue
, and it still might be, but we also discovered that with the exception of a few
records
from the University of Georgia
, all of the sequences with 0.70 complexi
ty were also from the
University of Arizona. Because a 0.20 complexity represents a very simple sequence, we have a hard
time believing that these sequences are biologically significant and are actually being transcribed in
G.
raimondii
. This position is a
lso supported by the lack of
this bimodal behavior existing to this extent
in
any of the other
model
species
we analyzed
, most importantly the close relative
G. hirsutum.
It is also
important to remember that
G. raimondii’s
EST collection is the smallest o
f our model organisms with
63577 records and that a majority of those records seem to be from one contributor. This observation
leaves
G. raimondii
prone to scientifically insignificant aberrations.
Although
G. raimondii
was the only species to display
r
adical
abnormalities, some other species
displayed skewed distributions or slight bimodal
behavior, although other
bimodal behavior was not as
pronounced.
A. thaliana, B. taurus,
and
R. norvegicus
displayed a slight skew to the left
.
D. melanogaster
also d
isplayed a distribution skewed to the left with a slight mode on the tail
as shown in
Figure C
. Lack
of time and
a lack of an effective method to analyze these abn
ormalities precluded a further
investigation of these behaviors
by us
.
After computing the
graphs, we used hypothesis testing to determine whether these species’
differences in mean complexity were statistically signif
icant. Every combination of
species were
compared and the results are displayed in
Figure D.
Although the graph is difficult to u
nderstand,
the
tan cells are p

values that represent the probability that the two species being compared have the same
mean and that the differences in these samples were due to random chance. As seen, 3.35E

15 is the
largest probability value, therefore w
e can be
very
confident that
these
differences between species’
complexity averages are statistically significant. Whether these differences are biological or due to the
quality of sequencing and vector tr
imming, however, is not known at this time.
To ju
stify writing a program to take a sample of a species’ ESTs and estimate the average
complexity and standard deviation, we had to prove that this would give reliable results. We tested
sample sizes of 1000, 10000, and 100000 ESTs and compared the values th
ese samples gave us to the
actual values
from the full datasets
in
Figure E
.
The values were quite good. As could be expected, with a
larger sample size, the accuracy generally improved. We used these values to estimate a threshold that
would include the t
op 95% sequences in the high

complexity group and then compared these estimated
thresholds to our actual values. Assuming a normal distribution, our estimated values at 100000 samples
resulted in a 0% to a 1.6% error. Since we graphed the absolute value of
the percent error
,
this means
,
for example,
that our estimated threshold with the 1.6% error would either include 96.6% or 93.4% of
high complexity sequences in our sorting method, when we were shooting for a 95% rate. The accuracy
of these sample sizes w
ere graphed in
Figure F
. Even a sample size of 10000 results in fairly accurate
results with a 0%

2.4% range of error. Our 1000 sized samples resulted in a 0.1% to 5% range in error,
which is very g
ood considering a 1000 sequence sample of
H. sapiens
with its
8068
722 ESTs represents a
sample size covering .012% of the
full population and resulted
in an estimate with only 1.8% error at the
95% threshold.
After analyzing the previous data, we programmed
getcomplex.pl
. This
is a tool for researchers
to
analyze the complexity distributions of a
user

supplied
species and estimate the average complexity
and standard deviation of complexity. This command line program takes the following parameters:
<Number of ESTS to grab> <Number of blocks to brea
k up sample size into> <Genus>
<Species> <Whether or not to output histogram data (y/n
)
>
.
Figure G
shows a sample use of
this program. If these command line arguments are omitted,
getcomplex.pl
prompts the user for them.
Samples of the various output files
are included in
Figure H.
Our second program
sort.pl
is used to sort a FASTA file of EST sequences into high

complexity
and low

complexity files
. It takes the following command line arguments:
<Input FASTA file>
<Average complexity> <Standard deviation>
<N
umber of standard deviations below average
to set threshold at. If omitted, use 95% confidence>.
Figure
I
shows a sample use of this
program. If these command line arguments are omitted,
sort.pl
will also prompt the user
for them
.
The
output is in two FA
STA files.
Both programs are more fully documented in their source code and are
available at
http://psoda4.cs.byu.edu/~arkangel/sort.pl
and
http://psoda4.cs.byu.edu/~arkangel/getcomplex.pl
.
A sample EST from
both
the high and low output
files is include
d
in
Figure J
.
Conclusion
We were able to prove all of the assumptions that were necessary to justify creating
getcomplex.pl
and
sort.pl
. That is, we showed that the complexity of ESTs stored on Genbank follows a
normal distribution,
that
there are statistically significant differences between the average complexities
from species to species, and that a
small
sample can be used to g
ive an accurate estimation of the full
collection of a species’ ESTs.
Along the way we discovered some interesting trends. There are several species that show
abnormal behavior in their normal distributions, specifically
G. raimondii
which actually
serve
d as
our
motivation to undertake this analysis. The distribution of complexities for
G. raimondii
showed slight
bimodal distribution, which was also seen to a lesser extent in
D. melanogaster
. Other species showed
abnormal tails. Investigation into
G. raim
ondii
left us with few clues as to why this behavior is observed.
All of the low

complexity sequences we analyzed were from the same source which would lead us to
believe that there might be a correlation, although this discovery is made less significant c
onsidering the
relatively small number of total ESTs on Genbank and the high percentage of the total ESTs sequenced
by the source in question.
Our tools
getcomplex.pl
and
sort.pl
are effective in what they were design
ed to do
. Both tools
have been progra
mmed with versatility in mind to maximize their application. Although the difference in
average complexity between species is statistically signifi
cant, it is still unclear as to whether there is
enough variation to justify the use of
getcomplex.pl
for eve
ry new species considered, or whether a
more general threshold could be used with similar results. However,
getcomplex.pl
was programmed
with the ability to also download, store, and analyze the complexities for a species’ full collection of ESTs
rather th
an just a sample
,
to
make it useful in further
studies analyzing either the quality of Genbank’s
data
for a species
, or the biological differences in complexity
between species
.
Considering our findings, there are several areas in which further study wou
ld be beneficial. The
issue of bimodal distributions and abnormal distribution behavior for several of the model species could
benefit from further research. Perhaps an investigation into the methods used to analyze the ESTs for
G.
raimondii
at the Univers
ity of Arizona could be launched to identify whether this abnormality was
indeed a biological finding, or whether it was due to errors in the laboratory processes. Also, the process
of discarding low

complexity sequences using
sort.pl
requires justificatio
n and further research is
needed
to determine the reasons behind low

complexity sequences. If a researcher is discarding large
portions of low complexity sequences, rather than moving forward as planned, there should be
investigations into the laboratory p
rocesses to make sure that whatever may be the problem is rectified
to preserve the integrity of the sequences considered high

complexity. Finally, some of the alleged high

complexity sequences showed marked bias in terms of base composition, eg.
“ATATATAT
TTATATATATATA”. Our tools could be simply modified to do a more general analysis of
complexity which would also consider base composition. It would be interesting to
see if such an
analysis would highlight other statistically significant trends, although t
here seems to be much more
work already devoted to this study.
Bibliography
Knight, R. D., Freeland, S. J., & Landweber, L. F. (2001). Rewiring the Keyboard: Evolvability of the
Genetic Code.
Nature Reviews: Genetics
, 2
, 49

58.
Malde, K., Schneeberger, K., Coward, E., & Jonassen, I. (2006). RBR: library

less repeat detection for
ESTs.
Bioinformatics
, 22
(18), 2232
–
2236.
Appendix
Figure A.
S. cerevisiae
AVG
0.70354225
STDEV
0.030872309
n
34915
sq(stdev)/n
2.72977E

08
A. thaliana
AVG
0.701177533
STDEV
0.068968142
n
1526133
sq(stdev)/n
3.11677E

09
B. taurus
AVG
0.700498458
0
1000
2000
3000
4000
5000
6000
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
Saccharomyces cerevisiae
0
20000
40000
60000
80000
100000
120000
140000
160000
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
Arabidopsis thaliana
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
200000
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
Bos taurus
STDEV
0.048474898
n
1517139
sq(stdev)/n
1.54885E

09
C. elegans
AVG
0.720108214
STDEV
0.043045343
n
352043
sq(stdev)/n
5.26328E

09
C. reinhardt
AVG
0.743186758
STDEV
0.040043046
n
202044
sq(stdev)/n
7.93612E

09
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
Caenorhabditis elegans
0
5000
10000
15000
20000
25000
30000
35000
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
Chlamydomonas reinhardt
D.
melanogaster
AVG
0.719327599
STDEV
0.049511605
n
573749
sq(stdev)/n
4.2726E

09
G.
hirsutum
AVG
0.696788579
STDEV
0.045301953
n
268775
sq(stdev)/n
7.63563E

09
G. raimondii
0
10000
20000
30000
40000
50000
60000
70000
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
Drosophila melanogaster
0
5000
10000
15000
20000
25000
30000
35000
40000
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
Gossypium hirsutum
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
Gossypium raimondii
AVG
0.664671947
STDEV
0.122276641
n
63577
sq(stdev)/n
2.35173E

07
H. sapiens
AVG
0.704307195
STDEV
0.049948167
n
8068722
sq(stdev)/n
3.09196E

10
O. sativa
AVG
0.725641103
STDEV
0.064738499
n
1220909
sq(stdev)/n
3.43275E

09
0
100000
200000
300000
400000
500000
600000
700000
800000
900000
1000000
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
Homo sapiens
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
Oryza sativa
R.
norvegicus
AVG
0.702131811
STDEV
0.067169628
n
951375
sq(stdev)/n
4.74236E

09
Figure B.
0
20000
40000
60000
80000
100000
120000
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
Rattus norvegicus
This graph shows the bimodal character of
G. raimondii’s
ESTs on Genbank.
Figure C.
D. melanogaster
also showed slight abnormal bimodal behavior in its distribution.
Figure D.
Hypothesis testing comparing each species’ values to every other species.
The yellow square serves as a
key for the center squares, while the upper right box is a key for the blue boxes. The highlighted tan
numbers are probability values that the two species’ mean were the same and that any variation is
attributable to random c
hance.
As seen, 3.35E

15 is the largest p

value and corresponds to over a
99.9999999999% probability that these populations have statistically different means.
Figure E.
Species/n
Population
1000
10000
100000
A. thaliana
Avg
0.701177533
0.704953844
0.701818485
0.70012193
1526133
StDev
0.068968142
0.060268948
0.067891779
0.068756037
B. taurus
Avg
0.700498458
0.704789524
0.703444384
0.702401342
1517139
StDev
0.048474898
0.040374354
0.04451912
0.045552423
C. elegans
Avg
0.720108214
0.718904761
0.719863962
0.719264242
352043
StDev
0.043045343
0.044135786
0.041741025
0.043294473
C. reinhardtii
Avg
0.743186758
0.745321006
0.742471104
0.742788144
202044
StDev
0.040043046
0.038213564
0.042210503
0.040420094
D. melanogaster
Avg
0.719327599
0.718491077
0.720603059
0.718887618
573749
StDev
0.049511605
0.050877726
0.048422207
0.050121207
G. hirsutum
Avg
0.696788579
0.695711555
0.696861365
0.697040056
268775
StDev
0.045301953
0.04441335
0.045120819
0.044956093
G. raimondii
Avg
0.664671947
0.666439442
0.664654113
0.664630431
63577
StDev
0.122276641
0.12192608
0.122039799
0.122340832
H. sapiens
Avg
0.704307195
0.703982785
0.703725897
0.703115552
8068722
StDev
0.049948167
0.044982798
0.046844251
0.048067702
O. sativa
Avg
0.725641103
0.727830913
0.724950541
0.725200084
1220909
StDev
0.064738499
0.061083872
0.061916799
0.067409556
R. norvegicus
Avg
0.702131811
0.700817867
0.70150445
0.70288693
951375
StDev
0.067169628
0.068064136
0.068477837
0.065844582
S. cerevisiae
Avg
0.70354225
0.702401286
0.703579347
0.70354411
34915
StDev
0.030872309
0.031368752
0.030343204
0.030864195
This chart is a comparison of how accurate the estimates for average complexity and standard deviation
were. The sample sizes are 1000, 10000, and 100000 and
are compared against the actual values from
the full population.
Figure F.
This chart shows that if assuming a normal distribution, estimating the threshold to include the top 95%
of sequences by using 100000, results in at max about a 1.5% error and at
best a negligible error.
0.0%
1.0%
2.0%
3.0%
4.0%
5.0%
6.0%
Cumulative percentage difference from
threshold of full data set
1000 Samples
10000 Samples
100000 Samples
Figure G.
First couple lines of output from
getcomplex.pl
with and without command line arguments.
Figure H.
File containing results of complexity analysis.
Portion of the histogram CSV created to analyze the distribution
of complexities.
Portion of the CSV created with IDs and complexity values. This will be useful in trying to identify
abnormal behavior in the distributions.
Figure I
.
Output from
sort.pl
with and without command line arguments.
Figure J.
Top sequence is a high complexity sequence sorted by
sort.pl
and the bottom is a low complexity
sequence from the output files. These are sequences from our test data set based off a ESTs collection
on
G. raimondii
provided by Dr. Udall.
Comments 0
Log in to post a comment