CS5624 Course Project Proposal
1
Abstract
—
This is the proposal of Yating
Wang, Yang Chen and
Yen

Cheng Lu for the course project of CS5624 “Introduction to
Data Mining”. In this document, we briefly introduced the
background and motivation, and proposed our scheme of this
project, including the related work, applied dataset, a
lgorithm
implementation, development tools, visualization strategies, task
assignment and schedule.
Index Terms
—
B
io

information
,
data mining
, gene expression,
data reduction, visualization
.
I.
I
NTRODUCTION
&
MOTIVATION
IOLOGICAL
data mining brings us a great step towards
understanding those hardly curable diseases by traditional
medicine knowledge. By digging out patterns and
relationships between one or more categories of data, we
conclude some statistical and sparkling results
contributing to
disease control and treatment. And we also inherit classic
graphs in conventional data mining to showcase complex
connections between data. However overwhelming number of
data and extensive amount of data dimensions are seldom
entirely enum
erated due to being overlapped, un

ruling and/or
unnoticeable. Consequently, it is essential to apply other
fashions to graph tremendous biological data set.
Visualization, on the other hand, is a more friendly and
direct

understandable approach to illus
trate the facts. Vivid
colors represent different entities; closer

to realistic figures
help mimic their internal development; and flexible
perspectives depict multiple layers of correlation.
Consequently, we believe its properties are well suited for
gra
phing multiplex biological data, in particular omics

scale
data, and also will be beneficial for to further data analysis for
specific domains.
Nevertheless, visualizing large data set does not mean
skipping of meticulous data analysis and dimensional
d
eduction for further relationship study. We still need to apply
conventional data mining methods to the raw data set,
discovering internal connections in multiscales. Therefore, we
will utilize proper data mining approaches to study our selected
data set,
and illustrate our data study result by the optimal
visualization process with refinement in a specifically designed
user

interface.
II.
RELATED WORK
Gene expressed in microarray techniques is usually in a form
of an N*M matrix, where N represents the number o
f genes and
M represents the number of samples. And the more general
property of this matrix is of N with a particularly extensive
number, and thus, given better enumerating the connection
between a genome expression and a bundle of genes, the
sample numbe
r

M

will be hardly small. The issue of data
mining among this combination of genome and samples is the
curse of dimensionality. In specific, sometimes not all the genes
appearing in the matrix are necessarily determinative for a
certain expression. Visual
ization of gene expression
contributes a fairly effective and readable graph for the
professionals to efficiently analyze the most possible
connection between finite number of genes and a protein
expression. We can also generalize the processes of
visualiz
ation in this area is that first decreasing the dimension
into two or three, and second projecting it into a 2D or 3D
graph.
There various works have been done in gene visualization
and we can categorize them into three areas: plotting a set of 2D
graphs
[
1
,
2
]
, casting a 3D graph
[
3

6
]
, and evaluating the
effecti
veness of dimension reduction for visualization
purposes
[
7

9
]
.
A.
2D graph
Massanet et al.
[
1
]
analyze on gene ontology interactions and
then cast the protein relational data onto a MDS plane so as to
cluster the similar gene ontology
–
defined by semantic
distance between two biologic
al products

in the same group.
The advantage of this plot is the simplicity of the visualizing,
however it does not give a detailed explanation of how to
deduct dimension.
New et al. also visualize the coexpression of gene data
dynamically
[
2
]
. The main feature of their work is providing a
real

time and interactive tool for professionals to visualize the
gene data so that the potential relevant gene can be investigated
from a subgraph. The obvious advantage of
this method is its
applicability to real

time, therefore it is useable to data stream.
However in the perspective of entire situation, we doubt its
accuracy, since its analysis is based on subgraph and fuzzy
classification, instead of the entire situation
and more accurate
logic. One output example from their work is shown in Figure
1.
B.
3D graph
Tominski et al.
[
3
]
visualized the gene combinations as
shown in Figure 2. Their procedure of visualization is first
finding out the relevantly related gene combinations by filter,
and then visualizing them. Their fi
ltering algorithm is from
Dr
ă
ghici’s work
[
10
]
, and visualization panel is extended on
Heatmap. By using several modified Heatmaps, combinations
of distinct number o
f genes are labeled in different panels. And
Course Project
Proposal of CS5624
Yating Wang, Yang Chen,
and
Yen

Cheng Lu
,
Student
,
Virginia Tech
B
CS5624 Course Project Proposal
2
then differentiating the color on the grids represents the
aggregated
expression value. Particularly, darker color is for
actually not or less regulated gene combinations, and lighter
color for more
Figure 1
: candidate gene related to other genes’ expressions
possible combinations. The benefit of these sets of panels is
improving the correctness of gene interpretation. We can
conveniently overview the situation of the entire gene domain
and efficiently find
out the combinations for similar
expressions. Nevertheless, given N genes, there are
(i=2,3,4…n)iC(N) available combinations. The closer n to N,
the more panels presented, which is a rather big work for
human beings.
Figure 2:
ViGeCo
–
Visualization o
f Gene Combinations
Weber et al. create PointCloudXplore (PCX)
[
4
]
, a 3D
visualizations tool for gene expressions, which can visualize an
physical overview according to
the connections of cells and
cast the relationships of several genes. This tool is based on a
higher level for visualizing genes specifically to embryo cell.
Its main feature is on 3D, and also 2D helps for visualizing the
analysis for the 3D graphs. It
is a vitally readable and
easy

understanding plot. The example of this plot is also shown
in Figure 3.
Matte

Tailliez et al.
[
5
]
also implement 3D visualization for
gene explorations that is based on dynamic graph visulization.
The properties of their work are first it can visualize a huge
dataset and second it can explore the huge
data. Typically the
visualized data is store in a XML form and connected by binary
relationships, which can be easily valuated. A screen shot of
their work is shown in Figure 4.
Figure 3: PointCloudXplore
Figure 4: Screen shot of a particular zone
of genes groups
C.
Evaluation
Saraiya et al.
[
7
]
qualify five microarray visualization tools
–
Clusterview, TimeSearcher, HCE, Spotfire
, and GeneSpring

basing on the applicable dataset
–
Time

Series Data Set, Viral
Data Set, Lupus Data Set, user interfaces, and an evaluator of
the effectiveness of the visualizations, considering of
participants as well
–
Domain Expert, Domain Novice,
and
Software Developers. At last it illustrates these five tools
comparisons, in terms of Domain Value, Average Time of First
Insight, Average of Total Time, and Average of Final Amount.
Venna and Kaski compare the visualization method for gene
expression
data set
[
8
]
, which studies the possibility of
visualizing gene bank and suitable dimension deduction
method for gene data visualization. The results show that
curvilinear components analysis is the most
effective one than
other dimensionality cutting algorithm.
A
PPENDIX
Already found many helpful references and related works in
CS5624 Course Project Proposal
3
gene data visualization and figured out the most effective ways
for dimensionality deduction in specific application area. Knew
the general visualization process and methodologies.
III.
TOOLS
Because of the large size of expression profile data in genes,
one common problem in visualization is how to present the data
that allows easy viewing of different parts of it in detail and at
the same time, provides a view from a distance to display the
overall characteristics. Over years since the invention of
micro

array technology, numerous software systems have been
developed to address this problem in different ways, such as
Tree View, Ge
neSpring, DecisionSite. One of the commonly
used information visualization techniques related to the display
of genealogical trees. Combining genealogical trees
information visualization with 3D visualization of confocal
microscopy involves the use of a co
llective set of technologies
related to computer graphics, image analysis and visualization.
The most important topics in visualization of biology is the
high dimensions of the data sets. There are many approaches to
solve the curse of dimensionality, line
ar method and non

linear
method. In the linear algorithms of dimensionality reduction,
The Principle component Analysis (PCA) performs a
covariance analysis between factors. PCA can reduce the data
into two dimensions for visualizing.
As such, it is suitab
le for
data sets in multiple dimensions, such as a large experiment in
gene expression. In non

linear method, the manifold learning
method focuses on assuming the data of interest lies on an
embedded non

linear manifold within the higher

dimensional
space.
If the manifold is of low enough dimension then the data
can be visualized in the low dimensional space. In our
experiment, we prefer those two algorithms for dimensionality
reduction.
Figure
5
:
Principle Components Analysis
In the project, we choose
Java
programming language
and
Microsoft SQL server2005 as the database.
Java
is the one of
the most practice programming language and it can
be
easily
combined with C and C++ applications.
And R tools is related
to majority of the biological data set, we c
hoose R to extract the
micro

array data and pre

analysis them.
IV.
D
ATA
S
ET
In our project, in particular, we do study on microarray data
of Colon Cancer from
[
11
]
.
The data package
62 samples (40
tumor samples, 22 normal samples) from colon

cancer patients
were analyzed with an Affyme
trix oligonucleotide Hum6000
array.
Expression set with 2000 genes and 62 samples.
According to our basic analysis of original data set,
analyzing potential connections between specific genome or
genomegroup
and tumor
, then we will evaluate our estimation
.
Applying proper data dimension reduction, we visualize the
result in readable fashions, displaying different scope of the
result to show internal patterns. We propose eventually, after
further refinement, we will output the result in a user

friendly
inte
rface.
V.
D
IMENSIONALITY
R
EDUCTION
Before we go to visualize the huge data, we need to solve the
curse of dimensionality. Otherwise, we cannot lost majority of
the information the data have when we display them in 2D or
3D figures.
There are many methods for dimensionality reduction of
high

dimensional data, linear and non

linear approaches.
A.
Principal Component
Analysis.
PCA is mathematically defined
as an
orthogonal
linear
transformation
that transforms the data to a
new
coordinate
system
such that the greatest variance by any projection of the
data comes to lie on the first coordinate (called the first
principal component), the seco
nd greatest variance on the
second coordinate, and so on. PCA is theoretically the optimum
transform for given data in
least square
terms.
We use the basic PCA method to deal wit
h colon data set.
But we want to improve the PCA method. The
technique is just
base on the matrix of different gene and just considers
the value
of gene
expressions;
we want to add
information
to execute the
PCA technique.
Fig
ure
6
:
C
onnection of
different gene
CS5624 Course Project Proposal
4
The node f
i
(i =1, 2, 3,…,n ) represent the genes of the data
set
. We can optimize the objective function of PCA and then
get an optimization of PCA and the relationship network of
different genes. We will consider the relation of relevant
genes
and cluster them in the same cluster. This will help us to get the
more effective combination of dimensions to low dimensions.
B.
Locally

Linear Embedding.
Locally

Li
near Embedding
(LLE)
was a
fast
optimization
algorithm
when implemented to take advan
tage of sparse
matrix algorithms, and better results with many problems. LLE
also begins by finding a set of the nearest neighbors of each
point. It then computes a set of weights for each point that best
describe the point as a linear combination of its n
eighbors.
Finally, it uses an eigenvector

based optimization technique to
find the low

dimensional embedding of points, such that each
point is still described with the same linear combination of its
neighbors. LLE tends to handle non

uniform sample densit
ies
poorly because there is no fixed unit to prevent the weights
from drifting as various regions differ in sample densities. LLE
has no internal model
.
C.
Isomap.
Isomap
is a combination of th
e
Floyd

Warshall algorithm
with classic
Multidimensional
Scaling
. Classic
Multidimensional Scaling (MDS) takes a matrix of pair

wise
distances between all points, and computes a position for each
point. With NLDR algorithms like Isomap, however, the
pair

wise distances are only known between neighboring
points.
So Isomap uses the Floyd

Warshall algorithm to
compute the pair

wise distances between all of the other points.
This effectively estimates the full matrix of pair

wise
g
eodesic
distances
between all of the points. Isomap then uses classic
MDS to compute the reduced

dimensional positions of all the
points.
Landmark

Isomap is a variant of this algorithm that uses
landmarks to increase speed, at the cost of some accuracy.
Our model of reduction of dimensionality consists data
pre

processing, kernel algorithm, inverse part of dimensionality
reduction.
Fig
ure
7
:
model of DR
VI.
V
ISUALIZATION
Our team will develop a
Windows

b
ased software for
visualizing the microarray
experiment data. For this part, we
separated it to two tasks.
A. Graphic User Interface Design.
For a user

friendly concern, we will design this software in a
kind and simple style. We will provide the basic functions in
the software as a microarray data
viewer, for example, data
loading, result exporting, data viewing and analysis, etc.
Furthermore, it will be programmed in a multi

threaded
framework, users will be able to view the previous results while
another calculation is performing. It will also pr
ovide users to
load data from different sources, even though we are focusing
on the Colon Cancer data set
.
We have already designed a prototype user interface for
displaying the data and visualizations. The software will
provide users to
import several di
fferent data in the same
workspace. And for each imported data, it will be able to utilize
several different dimension reduction algorithms on the data.
Then, users will be able to
choose the different visualization
tools to visualize the preprocessed
data.
Fig
ure
8:
Prototype s
oftware
We proposed the flow chart of our software structure as
below:
Fig
ure
9
:
Architecture flow chart of the software
CS5624 Course Project Proposal
5
And we also proposed the specific software runtime data
flow
below as Fig.
10
.
Figure
10
:
Run
time software data flow design.
B. Microarray experiment data visualization
In terms of data visualization, we will implement an
appropriate visualizing method that specific design for the
Colon Cancer microarray data set. It will combine scatter plot,
h
eat map, curve diagram or other better solutions and will be
implemented in 2D or 3D view.
VII.
T
ASK
A
SSIGNMENT
In the project, we will follow the MVC model to develop
application. Yating will manage the data analysis tasks while
Yang will focus on data reduc
tion and data mining algorithms.
Yen

Cheng will take on the tasks of visualization and user
interface.
VIII.
S
CHEDULE
Here is our schedule:
Table
1
Schedule
Date
Objective
09/16/2010
Submit proposal
09/20/2010
Finish the overview and background
knowledge.
Start the design of the application
09/27/2010
Finish the design and give the design in detail
10/01/2010
According to the module in design, start
coding.
10/14/2010
Semi

semester report
10/31/2010
Make
sure major part of application finished
11/10/2010
Finish the development.
11/11/2010
Progress report
11/12/2010
Design some advanced function in
application
11/19/2010
Finish the whole project and start to summary
11/26/2010
Finish the demo and the r
eport of project
12/02/2010
Final report and present in class
R
EFERENCES
[1]
R. C. Massanet, P. and Perera, A., "Use of Gene Ontology semantic
information in protein interaction data visualization,"
BioInformatics and
BioEngineering, 2008. BIBE 2008. 8th IEEE International Conference on,
pp. 1

5, 2008.
[2]
J. New, Kendall, W. Huang, J. and Chesler E., "Dynamic Visualization of
Coexpression in Systems Genetics Data,"
IEEE Transaction on
Visualization and Computer Graphi
cs,
vol. 14, pp. 1081

1093,
September/October 2008 2009.
[3]
C. Tominski, and Schumann, H., "Visualization of Gene Combinations,"
presented at the International Conference Information Visualisation,
London, UK, 2008.
[4]
G. R. Weber, O. Huang, M

Y. DePace,
A. Fowlkes, C. Keränen, S.
Hendriks, C. Hagen, H. Knowles, D. Malik J. Biggin, M. and Hamann B.,
"Visual Exploration of Three

Dimensional Gene Expression Using
Physical Views and Linked Abstract Views,"
IEEE/ACM
TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND
B
IOINFORMATICS,
vol. 6, pp. 296

309, 2009.
[5]
O. T.

N. Matte

Tailliez, C.; Ferey, N.; Kepes, F.; Ghebi, R.;, "Immersive
Visualization for Genome Exploration and Analysis," presented at the
Information and Communication Technologies, Damascus, Syria, 2006.
[6]
N. I. Hosoyama, H.;, "3

D Visualization of a Gene Regulatory Network:
Stochastic Search for Layouts," presented at the Evolutionary Computation
2003, CEC '03. The 2003 Congress on, Canberra, Australia, 2003.
[7]
P. N. Saraiya, C. and Duca, K., "An Insi
ght

Based Methodology for
Evaluating Bioinformatics Visualizations,"
IEEE TRANSACTIONS ON
VISUALIZATION AND COMPUTER GRAPHICS,
vol. 11, pp. 443

456,
2005.
[8]
J. Venna, and Kaski, S, "Comparison of visualization methods for an atlas
of gene expression data
sets," Helsinki University of Technology, Helsinki,
paper.
[9]
J. L. Shi, Z,;, "Analysis and Visualization of Gene Expression Data via a
Framework of Geometric Representation," in
The 1st International
Conference on Information Science and Engineering (IC
ISE2009)
,
Chennai, India, 2009, pp. 3596

3599.
[10]
Drăghici. (2003) Data Analysis Tolls for DNA Microarrays.
Chapman &
Hall/CRC
.
[11]
S. Merk. (1999).
exprSet for Alon et al. (1999) colon cancer data
.
Available:
http://www.bioconductor.org/packages/2.6/data/experiment/html/colonC
A.html
CS5624 Course Project Proposal
6
Appendix
–
Progress Table
Task
Comments
Status
Yen

Cheng Lu
Create P
roject in NetBeans
Java
IDE
.
Completed
.
Design
the
prototype bio visualization software
.
Completed
.
Design the
data flow structure of software in run
time environment.
Completed
.
Design and program the main user interface.
Completed
.
Program the visualization manager in Java.
Ongoing
.
Set up the connection from main frame to
visualization manager and algorith
m manager.
Set up the program to call functions from algorithm
set to process the data reduction and send the result
to visualization manager to display.
Ongoing
.
Design
and implement
a novel visualization tool.
Studying bio informatics papers to gain
knowledge
for improving visualization tools.
Ongoing
.
Integrate the program.
On schedule.
Test and debug for the software.
On schedule.
Package the whole software.
On schedule.
Yang Chen
Study PCA algorithm
Completed
.
Install environment for
project
NetBeans, Java JDK, R, JRI
Completed
.
Design the prototype for dimensionality reduction
Completed
.
Program the algorithm part in Java
Ongoing
Find the relationship of R and Java
Ongoing
Improve the PCA algorithm
On schedule.
Use more
algorithm for reduction
LLE and Isomap, and so on
On schedule.
Test and debug for that part
On schedule.
Yating Wang
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο