Project Report CheckPoint II - Filebox - Virginia Tech

guitarchanceΛογισμικό & κατασκευή λογ/κού

15 Αυγ 2012 (πριν από 5 χρόνια και 26 μέρες)

408 εμφανίσεις

CS5624 Course Project Proposal


1


Abstract

This is the proposal of Yating

Wang, Yang Chen and
Yen
-
Cheng Lu for the course project of CS5624 “Introduction to
Data Mining”. In this document, we briefly introduced the
background and motivation, and proposed our scheme of this
project, including the related work, applied dataset, a
lgorithm
implementation, development tools, visualization strategies, task
assignment and schedule.



Index Terms

B
io
-
information
,

data mining
, gene expression,
data reduction, visualization
.


I.

I
NTRODUCTION

&

MOTIVATION


IOLOGICAL

data mining brings us a great step towards
understanding those hardly curable diseases by traditional
medicine knowledge. By digging out patterns and
relationships between one or more categories of data, we
conclude some statistical and sparkling results
contributing to
disease control and treatment. And we also inherit classic
graphs in conventional data mining to showcase complex
connections between data. However overwhelming number of
data and extensive amount of data dimensions are seldom
entirely enum
erated due to being overlapped, un
-
ruling and/or
unnoticeable. Consequently, it is essential to apply other
fashions to graph tremendous biological data set.

Visualization, on the other hand, is a more friendly and
direct
-
understandable approach to illus
trate the facts. Vivid
colors represent different entities; closer
-
to realistic figures
help mimic their internal development; and flexible
perspectives depict multiple layers of correlation.
Consequently, we believe its properties are well suited for
gra
phing multiplex biological data, in particular omics
-
scale
data, and also will be beneficial for to further data analysis for
specific domains.

Nevertheless, visualizing large data set does not mean
skipping of meticulous data analysis and dimensional
d
eduction for further relationship study. We still need to apply
conventional data mining methods to the raw data set,
discovering internal connections in multiscales. Therefore, we
will utilize proper data mining approaches to study our selected
data set,
and illustrate our data study result by the optimal
visualization process with refinement in a specifically designed
user
-
interface.

II.

RELATED WORK

Gene expressed in microarray techniques is usually in a form
of an N*M matrix, where N represents the number o
f genes and
M represents the number of samples. And the more general
property of this matrix is of N with a particularly extensive
number, and thus, given better enumerating the connection
between a genome expression and a bundle of genes, the
sample numbe
r
-
M
-

will be hardly small. The issue of data
mining among this combination of genome and samples is the
curse of dimensionality. In specific, sometimes not all the genes
appearing in the matrix are necessarily determinative for a
certain expression. Visual
ization of gene expression
contributes a fairly effective and readable graph for the
professionals to efficiently analyze the most possible
connection between finite number of genes and a protein
expression. We can also generalize the processes of
visualiz
ation in this area is that first decreasing the dimension
into two or three, and second projecting it into a 2D or 3D
graph.

There various works have been done in gene visualization
and we can categorize them into three areas: plotting a set of 2D
graphs
[
1
,
2
]
, casting a 3D graph
[
3
-
6
]
, and evaluating the
effecti
veness of dimension reduction for visualization
purposes
[
7
-
9
]
.

A.

2D graph

Massanet et al.
[
1
]

analyze on gene ontology interactions and
then cast the protein relational data onto a MDS plane so as to
cluster the similar gene ontology


defined by semantic
distance between two biologic
al products
-

in the same group.
The advantage of this plot is the simplicity of the visualizing,
however it does not give a detailed explanation of how to
deduct dimension.

New et al. also visualize the coexpression of gene data
dynamically
[
2
]
. The main feature of their work is providing a
real
-
time and interactive tool for professionals to visualize the
gene data so that the potential relevant gene can be investigated
from a subgraph. The obvious advantage of
this method is its
applicability to real
-
time, therefore it is useable to data stream.
However in the perspective of entire situation, we doubt its
accuracy, since its analysis is based on subgraph and fuzzy
classification, instead of the entire situation
and more accurate
logic. One output example from their work is shown in Figure
1.

B.

3D graph

Tominski et al.
[
3
]

visualized the gene combinations as
shown in Figure 2. Their procedure of visualization is first
finding out the relevantly related gene combinations by filter,
and then visualizing them. Their fi
ltering algorithm is from
Dr
ă
ghici’s work
[
10
]
, and visualization panel is extended on
Heatmap. By using several modified Heatmaps, combinations
of distinct number o
f genes are labeled in different panels. And
Course Project
Proposal of CS5624

Yating Wang, Yang Chen,
and
Yen
-
Cheng Lu
,
Student
,
Virginia Tech

B

CS5624 Course Project Proposal


2

then differentiating the color on the grids represents the
aggregated
expression value. Particularly, darker color is for
actually not or less regulated gene combinations, and lighter
color for more


Figure 1
: candidate gene related to other genes’ expressions


possible combinations. The benefit of these sets of panels is
improving the correctness of gene interpretation. We can
conveniently overview the situation of the entire gene domain
and efficiently find

out the combinations for similar
expressions. Nevertheless, given N genes, there are

(i=2,3,4…n)iC(N) available combinations. The closer n to N,
the more panels presented, which is a rather big work for
human beings.



Figure 2:
ViGeCo


Visualization o
f Gene Combinations


Weber et al. create PointCloudXplore (PCX)
[
4
]
, a 3D
visualizations tool for gene expressions, which can visualize an
physical overview according to

the connections of cells and
cast the relationships of several genes. This tool is based on a
higher level for visualizing genes specifically to embryo cell.
Its main feature is on 3D, and also 2D helps for visualizing the
analysis for the 3D graphs. It
is a vitally readable and
easy
-
understanding plot. The example of this plot is also shown
in Figure 3.

Matte
-
Tailliez et al.
[
5
]

also implement 3D visualization for
gene explorations that is based on dynamic graph visulization.
The properties of their work are first it can visualize a huge
dataset and second it can explore the huge
data. Typically the
visualized data is store in a XML form and connected by binary
relationships, which can be easily valuated. A screen shot of
their work is shown in Figure 4.





Figure 3: PointCloudXplore



Figure 4: Screen shot of a particular zone
of genes groups


C.

Evaluation

Saraiya et al.
[
7
]

qualify five microarray visualization tools


Clusterview, TimeSearcher, HCE, Spotfire
, and GeneSpring
-

basing on the applicable dataset


Time
-
Series Data Set, Viral
Data Set, Lupus Data Set, user interfaces, and an evaluator of
the effectiveness of the visualizations, considering of
participants as well


Domain Expert, Domain Novice,
and
Software Developers. At last it illustrates these five tools
comparisons, in terms of Domain Value, Average Time of First
Insight, Average of Total Time, and Average of Final Amount.

Venna and Kaski compare the visualization method for gene
expression

data set
[
8
]
, which studies the possibility of
visualizing gene bank and suitable dimension deduction
method for gene data visualization. The results show that
curvilinear components analysis is the most

effective one than
other dimensionality cutting algorithm.


A
PPENDIX

Already found many helpful references and related works in
CS5624 Course Project Proposal


3

gene data visualization and figured out the most effective ways
for dimensionality deduction in specific application area. Knew

the general visualization process and methodologies.


III.


TOOLS


Because of the large size of expression profile data in genes,
one common problem in visualization is how to present the data
that allows easy viewing of different parts of it in detail and at

the same time, provides a view from a distance to display the
overall characteristics. Over years since the invention of
micro
-
array technology, numerous software systems have been
developed to address this problem in different ways, such as
Tree View, Ge
neSpring, DecisionSite. One of the commonly
used information visualization techniques related to the display
of genealogical trees. Combining genealogical trees
information visualization with 3D visualization of confocal
microscopy involves the use of a co
llective set of technologies
related to computer graphics, image analysis and visualization.

The most important topics in visualization of biology is the
high dimensions of the data sets. There are many approaches to
solve the curse of dimensionality, line
ar method and non
-
linear
method. In the linear algorithms of dimensionality reduction,
The Principle component Analysis (PCA) performs a
covariance analysis between factors. PCA can reduce the data
into two dimensions for visualizing.

As such, it is suitab
le for
data sets in multiple dimensions, such as a large experiment in
gene expression. In non
-
linear method, the manifold learning
method focuses on assuming the data of interest lies on an
embedded non
-
linear manifold within the higher
-
dimensional
space.

If the manifold is of low enough dimension then the data
can be visualized in the low dimensional space. In our
experiment, we prefer those two algorithms for dimensionality
reduction.


Figure

5
:

Principle Components Analysis


In the project, we choose

Java
programming language
and
Microsoft SQL server2005 as the database.
Java

is the one of
the most practice programming language and it can
be
easily
combined with C and C++ applications.

And R tools is related
to majority of the biological data set, we c
hoose R to extract the
micro
-
array data and pre
-
analysis them.

IV.

D
ATA
S
ET


In our project, in particular, we do study on microarray data
of Colon Cancer from
[
11
]
.
The data package

62 samples (40
tumor samples, 22 normal samples) from colon
-
cancer patients
were analyzed with an Affyme
trix oligonucleotide Hum6000
array.

Expression set with 2000 genes and 62 samples.


According to our basic analysis of original data set,
analyzing potential connections between specific genome or
genomegroup

and tumor
, then we will evaluate our estimation
.
Applying proper data dimension reduction, we visualize the
result in readable fashions, displaying different scope of the
result to show internal patterns. We propose eventually, after
further refinement, we will output the result in a user
-
friendly
inte
rface.



V.

D
IMENSIONALITY
R
EDUCTION


Before we go to visualize the huge data, we need to solve the
curse of dimensionality. Otherwise, we cannot lost majority of
the information the data have when we display them in 2D or
3D figures.

There are many methods for dimensionality reduction of
high
-
dimensional data, linear and non
-
linear approaches.


A.

Principal Component
Analysis.


PCA is mathematically defined

as an
orthogonal

linear
transformation

that transforms the data to a

new
coordinate
system

such that the greatest variance by any projection of the
data comes to lie on the first coordinate (called the first
principal component), the seco
nd greatest variance on the
second coordinate, and so on. PCA is theoretically the optimum
transform for given data in
least square

terms.

We use the basic PCA method to deal wit
h colon data set.
But we want to improve the PCA method. The
technique is just
base on the matrix of different gene and just considers

the value
of gene
expressions;

we want to add
information

to execute the
PCA technique.


Fig
ure

6
:
C
onnection of
different gene


CS5624 Course Project Proposal


4

The node f
i

(i =1, 2, 3,…,n ) represent the genes of the data
set
. We can optimize the objective function of PCA and then
get an optimization of PCA and the relationship network of
different genes. We will consider the relation of relevant
genes
and cluster them in the same cluster. This will help us to get the
more effective combination of dimensions to low dimensions.


B.

Locally
-
Linear Embedding.


Locally
-
Li
near Embedding
(LLE)

was a
fast

optimization
algorithm
when implemented to take advan
tage of sparse
matrix algorithms, and better results with many problems. LLE
also begins by finding a set of the nearest neighbors of each
point. It then computes a set of weights for each point that best
describe the point as a linear combination of its n
eighbors.
Finally, it uses an eigenvector
-
based optimization technique to
find the low
-
dimensional embedding of points, such that each
point is still described with the same linear combination of its
neighbors. LLE tends to handle non
-
uniform sample densit
ies
poorly because there is no fixed unit to prevent the weights
from drifting as various regions differ in sample densities. LLE
has no internal model
.


C.

Isomap.


Isomap

is a combination of th
e
Floyd
-
Warshall algorithm

with classic
Multidimensional

Scaling
. Classic
Multidimensional Scaling (MDS) takes a matrix of pair
-
wise
distances between all points, and computes a position for each
point. With NLDR algorithms like Isomap, however, the
pair
-
wise distances are only known between neighboring
points.

So Isomap uses the Floyd
-
Warshall algorithm to
compute the pair
-
wise distances between all of the other points.
This effectively estimates the full matrix of pair
-
wise
g
eodesic
distances

between all of the points. Isomap then uses classic
MDS to compute the reduced
-
dimensional positions of all the
points.

Landmark
-
Isomap is a variant of this algorithm that uses
landmarks to increase speed, at the cost of some accuracy.


Our model of reduction of dimensionality consists data
pre
-
processing, kernel algorithm, inverse part of dimensionality
reduction.


Fig
ure

7
:

model of DR



VI.

V
ISUALIZATION


Our team will develop a
Windows
-
b
ased software for
visualizing the microarray
experiment data. For this part, we
separated it to two tasks.


A. Graphic User Interface Design.


For a user
-
friendly concern, we will design this software in a
kind and simple style. We will provide the basic functions in
the software as a microarray data

viewer, for example, data
loading, result exporting, data viewing and analysis, etc.
Furthermore, it will be programmed in a multi
-
threaded
framework, users will be able to view the previous results while
another calculation is performing. It will also pr
ovide users to
load data from different sources, even though we are focusing

on the Colon Cancer data set
.


We have already designed a prototype user interface for
displaying the data and visualizations. The software will
provide users to
import several di
fferent data in the same
workspace. And for each imported data, it will be able to utilize
several different dimension reduction algorithms on the data.
Then, users will be able to
choose the different visualization
tools to visualize the preprocessed

data.



Fig
ure

8:

Prototype s
oftware



We proposed the flow chart of our software structure as
below:


Fig
ure

9
:

Architecture flow chart of the software


CS5624 Course Project Proposal


5


And we also proposed the specific software runtime data
flow
below as Fig.
10
.



Figure
10
:
Run
time software data flow design.


B. Microarray experiment data visualization


In terms of data visualization, we will implement an
appropriate visualizing method that specific design for the
Colon Cancer microarray data set. It will combine scatter plot,
h
eat map, curve diagram or other better solutions and will be
implemented in 2D or 3D view.


VII.

T
ASK
A
SSIGNMENT


In the project, we will follow the MVC model to develop
application. Yating will manage the data analysis tasks while
Yang will focus on data reduc
tion and data mining algorithms.
Yen
-
Cheng will take on the tasks of visualization and user
interface.

VIII.

S
CHEDULE

Here is our schedule:


Table
1

Schedule

Date

Objective

09/16/2010

Submit proposal

09/20/2010

Finish the overview and background
knowledge.

Start the design of the application

09/27/2010

Finish the design and give the design in detail

10/01/2010

According to the module in design, start
coding.

10/14/2010

Semi
-
semester report

10/31/2010

Make
sure major part of application finished

11/10/2010

Finish the development.

11/11/2010

Progress report

11/12/2010

Design some advanced function in
application

11/19/2010

Finish the whole project and start to summary

11/26/2010

Finish the demo and the r
eport of project

12/02/2010

Final report and present in class


R
EFERENCES



[1]

R. C. Massanet, P. and Perera, A., "Use of Gene Ontology semantic
information in protein interaction data visualization,"
BioInformatics and
BioEngineering, 2008. BIBE 2008. 8th IEEE International Conference on,
pp. 1
-
5, 2008.

[2]

J. New, Kendall, W. Huang, J. and Chesler E., "Dynamic Visualization of
Coexpression in Systems Genetics Data,"
IEEE Transaction on
Visualization and Computer Graphi
cs,
vol. 14, pp. 1081
-
1093,
September/October 2008 2009.

[3]

C. Tominski, and Schumann, H., "Visualization of Gene Combinations,"
presented at the International Conference Information Visualisation,
London, UK, 2008.

[4]

G. R. Weber, O. Huang, M
-
Y. DePace,

A. Fowlkes, C. Keränen, S.
Hendriks, C. Hagen, H. Knowles, D. Malik J. Biggin, M. and Hamann B.,
"Visual Exploration of Three
-
Dimensional Gene Expression Using
Physical Views and Linked Abstract Views,"
IEEE/ACM
TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND
B
IOINFORMATICS,
vol. 6, pp. 296
-
309, 2009.

[5]

O. T.
-
N. Matte
-
Tailliez, C.; Ferey, N.; Kepes, F.; Ghebi, R.;, "Immersive
Visualization for Genome Exploration and Analysis," presented at the
Information and Communication Technologies, Damascus, Syria, 2006.

[6]

N. I. Hosoyama, H.;, "3
-
D Visualization of a Gene Regulatory Network:
Stochastic Search for Layouts," presented at the Evolutionary Computation
2003, CEC '03. The 2003 Congress on, Canberra, Australia, 2003.

[7]

P. N. Saraiya, C. and Duca, K., "An Insi
ght
-
Based Methodology for
Evaluating Bioinformatics Visualizations,"
IEEE TRANSACTIONS ON
VISUALIZATION AND COMPUTER GRAPHICS,
vol. 11, pp. 443
-
456,
2005.

[8]

J. Venna, and Kaski, S, "Comparison of visualization methods for an atlas
of gene expression data

sets," Helsinki University of Technology, Helsinki,
paper.

[9]

J. L. Shi, Z,;, "Analysis and Visualization of Gene Expression Data via a
Framework of Geometric Representation," in
The 1st International
Conference on Information Science and Engineering (IC
ISE2009)
,
Chennai, India, 2009, pp. 3596
-
3599.

[10]

Drăghici. (2003) Data Analysis Tolls for DNA Microarrays.
Chapman &
Hall/CRC
.

[11]

S. Merk. (1999).
exprSet for Alon et al. (1999) colon cancer data
.
Available:
http://www.bioconductor.org/packages/2.6/data/experiment/html/colonC
A.html



CS5624 Course Project Proposal


6

Appendix


Progress Table

Task

Comments

Status

Yen
-
Cheng Lu

Create P
roject in NetBeans

Java

IDE
.


Completed
.

Design

the

prototype bio visualization software
.


Completed
.

Design the
data flow structure of software in run
time environment.


Completed
.

Design and program the main user interface.


Completed
.

Program the visualization manager in Java.


Ongoing
.

Set up the connection from main frame to
visualization manager and algorith
m manager.

Set up the program to call functions from algorithm
set to process the data reduction and send the result
to visualization manager to display.

Ongoing
.

Design
and implement
a novel visualization tool.

Studying bio informatics papers to gain
knowledge
for improving visualization tools.

Ongoing
.

Integrate the program.


On schedule.

Test and debug for the software.


On schedule.

Package the whole software.


On schedule.

Yang Chen

Study PCA algorithm


Completed
.

Install environment for
project

NetBeans, Java JDK, R, JRI

Completed
.

Design the prototype for dimensionality reduction


Completed
.

Program the algorithm part in Java


Ongoing

Find the relationship of R and Java


Ongoing

Improve the PCA algorithm


On schedule.

Use more
algorithm for reduction

LLE and Isomap, and so on

On schedule.

Test and debug for that part


On schedule.

Yating Wang