Dynamic hybrid clustering of bioinformatics by incorporating text ...

weinerthreeforksBiotechnology

Oct 2, 2013 (4 years and 1 month ago)

85 views

Dynamic Hybrid Clustering of
Bioinformatics by Incorporating
Text Mining and Citation Analysis

Frizo Janssens, Wolfgang Glänzel, and Bart De Moor



Presented by Cindy Burklow

CS 685: Special Topics in Data Mining

Professor Dr. Jinze Liu

University of Kentucky

April 17
th
, 2008


Outline


Introduction


Motivation


Related Work


Proposed Models


Proposed Algorithms


Results: Hybrid & Dynamic Clustering


Discussion of Pros and Cons


Questions


References

Introduction


Bioinformatics …


Computer Science


Information Technology


Solves problems in Biomedicine



Goal of Paper:
Investigate


Cognitive structure


Dynamics of bioinformatics core


Sub
-
disciplines


ISI

Web of Science & MEDLINE


Retrieval of core literature in bioinformatics

MeSH

= Medical
Subject Headings

Image Reference: Wolfgang
Glnzel
, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by
incorporating text mining and citation analysis, pg. 360, 368, KDD '07. ACM, San Jose, CA, August 2007.

Motivation


Bioinformatics field …


Dynamic


Evolving discipline


Fast growth rate


Monitor current trends


Predict future direction


Decision Making


Grants


Business Ventures


Research Opportunities


Image Reference: Wolfgang
Glnzel
, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by
incorporating text mining and citation analysis, pg. 361, 368, KDD '07. ACM, San Jose, CA, August 2007.

Related Work


Web mining


Bibliometrics


Text mining & citation analysis


Mapping of knowledge


Charting science & technology fields


Textual & graph
-
based approaches


Different perceptions of similarity between
documents or groups of documents



Related Work

Establishing the Data Set


Patra

&
Mishra



Bibliometric Study


MeSH

term based


Liberal delineation strategy with maximal recall


Broader interpretation of bioinformatics


Less restricted search strategy


Broader coverage of underlying database


14,563 journal papers

Related Work


Hybrid Clustering


He


Unsupervised spectral clustering of web pages


Wang &
Kitsuregawa



Contents
-
linked coupled
clustering algorithm of web pages


Dynamic hybrid clustering


Mei &
Zhai



Temporal Text Mining


Kullback
-
Leibler



Divergence for coherent themes &
Hidden Markov Models


Griffiths &
Steyvers



Latent
Dirichlet

Allocation with
hot topics in PNAS abstracts


Models: Data Set

Bibliometric Retrieval Strategy


Novel subject delineation strategy


Retrieve core literature


Combines textual components &
bibliometrics, citation
-
based techniques


Web of Science Edition of Thomson Scientific


7401 bioinformatics
-
related papers


1981 to 2004


Titles, abstracts, author keywords, and
MeSH

terms

Models


Text Analysis


All text was indexed with Jakarta
Lucene

Platform


Encoded in Vector Space Model using TF
-
IDF
weighting scheme


Text
-
based similarities


Cosine of angle between the vector representations of
two papers


No Stop word used during indexing


Porter Stemmer


All remaining terms from titles and abstracts


Bigrams


Candidate list of
MeSH

descriptors, author keywords,
and noun phrases


Latent Semantic Indexing (LSI)


10 terms


Models


Citation Analysis


Citation Graphs


Link
-
based algorithms


HITS


PageRank

Representative Publications

Text
-
based

Co
-
citation

Citation
-
based

Documents

QUANTIFY

SIMILARITIES

Boolean Input
Vectors

Cosine

Bibliographic
coupling (BC)

Image Reference: Google Logo from http://www.google.com

Models


Clustering


Agglomerative Hierarchical Clustering
Algorithm with Ward’s Method


Hard Clustering Algorithm:


Every publication is assigned to
exactly

1

cluster
.

Image Reference: Clustering Analysis
-

http://en.wikipedia.org/wiki/Data_clustering

Models


Clustering

Optimal

number of clusters

Combine Distance
-
based & Stability
-
based Methods Strategy


Dendrogram observation

Silhouette Curves:
Mean text and

Citation
-
based

Stability Diagram

Image Reference: Wolfgang
Glnzel
, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining
and citation analysis, pg. 364, 365, KDD '07. ACM, San Jose, CA, August 2007.

Proposed Algorithm


Hybrid Clustering


Cluster Input: Distances


Combining text mining and bibliometrics


Integrate text & citation info early in mapping
process before applying of clustering algorithm


Weighted linear combination



Fisher’s inverse chi
-
square method


Image Reference: Wolfgang
Glnzel
, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by
incorporating text mining and citation analysis, pg. 362, 363, KDD '07. ACM, San Jose, CA, August 2007.

Image Reference: Wolfgang
Glnzel
, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by
incorporating text mining and citation analysis, pg. 363 KDD '07. ACM, San Jose, CA, August 2007.

Proposed Algorithm


Dynamic Hybrid
Clustering


Goal: Match & track clusters through time


Process:


Separate hybrid clustering for each period


Determine optimal number of clusters


Dendrogram


Silhouette curve


Ben
-
hur

stability plot


Construct complete graph


All cluster
centroids

from each period as nodes


Edge weights as mutual cosine similarities in LSS


Form Cluster Chains


Keep edge weights > threshold, T1


Allow qualifying clusters to join > threshold, T2



Image Reference: Wolfgang
Glnzel
, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining
and citation analysis, pg. 367, KDD '07. ACM, San Jose, CA, August 2007.

Image Reference: Wolfgang
Glnzel
, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining
and citation analysis, pg. 367, KDD '07. ACM, San Jose, CA, August 2007.

Results


Hybrid Clustering

Silhouette Curve

Image Reference: Wolfgang
Glnzel
, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining
and citation analysis, pg. 364, KDD '07. ACM, San Jose, CA, August 2007.

Result


Hybrid Clustering

Silhouette Curve

Image Reference: Wolfgang
Glnzel
, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining
and citation analysis, pg. 364, KDD '07. ACM, San Jose, CA, August 2007.

Result


Hybrid Clustering

Stability

Image Reference: Wolfgang
Glnzel
, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining
and citation analysis, pg. 365, KDD '07. ACM, San Jose, CA, August 2007.

Result


Hybrid Clustering

Dendrogram

Image Reference: Wolfgang
Glnzel
, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining
and citation analysis, pg. 365, KDD '07. ACM, San Jose, CA, August 2007.

Result


Hybrid Clustering

Cluster

Characterization

RNA
structure
prediction

205

Protein
structure
prediction

1167

Systems
biology &
molecular
networks

694

Phylogeny &
Evolution

749

Genome
sequencing &
assembly

640


Gene /
promoter /
motif
prediction

995


Molecular
DBs &
annotation
platforms

1091

Multiple
sequence
alignment

713

Microarray
analysis

1147

Result


Dynamics Clustering

Histogram

Image Reference: Wolfgang
Glnzel
, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining
and citation analysis, pg. 365, KDD '07. ACM, San Jose, CA, August 2007.

Result


Dynamics Clustering

Cluster Chains

Image Reference: Wolfgang
Glnzel
, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining
and citation analysis, pg. 367, KDD '07. ACM, San Jose, CA, August 2007.

Yearly Publication Output

among Cluster chains

Image Reference: Wolfgang
Glnzel
, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining
and citation analysis, pg. 368, KDD '07. ACM, San Jose, CA, August 2007.

Dynamic Term

Network

Image Reference: Wolfgang
Glnzel
, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining
and citation analysis, pg. 368, KDD '07. ACM, San Jose, CA, August 2007.

Pros & Cons


Pros


Offers fresh perspective on clustering


Integrates various techniques


Provides insight into bioinformatics


Cons


Challenge of selecting the optimal number of
clusters still exists


There are many steps required to implement
their approach

Questions

References


Janssens
, F.,
Glänzel
, W., and De Moor, B. 2007. Dynamic hybrid
clustering of bioinformatics by incorporating text mining and
citation analysis. In
Proceedings of the 13th ACM SIGKDD international
Conference on Knowledge Discovery and Data Mining (San Jose,
California, USA, August 12
-

15, 2007). KDD '07. ACM, New York, NY,
360
-
369. DOI= http://doi.acm.org/10.1145/1281192.1281233


ISI

Web of Science Image:
http://apps.isiknowledge.com/WOS_GeneralSearch_input.do?highli
ghted_tab=WOS&product=WOS&last_prod=WOS&SID=3DamC8
GFDKmpBLhFOIM&search_mode=GeneralSearch


PubMed

Image: http://www.ncbi.nlm.nih.gov/pubmed/


The Apache Jakarta Project: http://lucene.apache.org/java/1_4_3/


Fisher’s Method: http://en.wikipedia.org/wiki/Fisher%27s_method



Data Mining
-

Concepts and techniques
” by Han and
Kamber
,
Morgan Kaufmann, 2006. (ISBN:1
-
55860
-
901
-
6)