LORNET Theme 4

naivenorthΤεχνίτη Νοημοσύνη και Ρομποτική

8 Νοε 2013 (πριν από 3 χρόνια και 10 μήνες)

80 εμφανίσεις

Pattern Analysis & Machine Intelligence

Research Group



UNIVERSITY OF WATERLOO



LORNET Theme 4

Data Mining and Knowledge Extraction for LO

T L : Mohamed Kamel

PI’s: O. Basir, F. Karray, H. Tizhoosh

Assoc PI’s: A. Wong, C. DiMarco

PAMI Research Group, University of Waterloo

Theme 4 Team

Leader: M. Kamel


PI’s:


Dr. Basir


Dr. Tizhoosh



Researchers


H. Ayad


R. Kashef


A. Ghazel


Dr. Makhreshi



Funding


CRC/CFI/OIT


NSERC


PAMI Lab




Dr. Karray


Asso PI (Wong,
DiMarco



M. Shokri


S. Hassan


A. Farahat


Dr. R. Khoury




PDS,


Vestech,


Desire2Learn



Graduated


R. Khoury, PhD 07


L. Chen, PhD 07


M. Makhreshi,PhD 07


K.Hammouda,PhD 07


R. Dara, PhD 07


Y.Sun, PhD 07


K. Shaban, PhD 06


Y. Sun, PhD 06


M. Hussin, PhD 05


Jan Bakus, PhD 05


A. Adegorite, MA.Sc04


A. Khandani, MA.Sc05.


S. Podder, MA.Sc.04

PAMI Research Group, University of Waterloo

Data and Knowledge Mining


Knowledge extraction and discovery of
patterns from data.


Labeling and categorization,
summarization, classification, prediction,
association rules, clustering

PAMI Research Group, University of Waterloo

Theme Overview

Knowledge

Extraction

Tagging

and

Organizing

Matching

and

Ranking

LO

Mining

Classification


(MCS, Data Partitioning,


Imbalanced Classes)

Clustering


(Parallel/Distributed Clustering,


Cluster Aggregation)

From Text


Syntactic: Keyword, Keyphrase
-
based


Semantic: Concept
-
based

From Images


Image Features, Shape Features

From Text + Images


Describing Images with Text


Enriching Text with Images

LO Similarity and Ranking

Association Rules / Social Networks

Reinforcement Learning

Specialized / Personalized Search


PAMI Research Group, University of Waterloo

Types of Data in LORNET

LCMS

Course

Course

Course

Module

Lesson

LO

Module

Module

Lesson

Lesson

LO

LO

Discussion Board

Thread

Post

Thread

Thread

Post

Post

Board

Board

Board

LOR

Metadata

Metadata

Metadata

Record

Record

Record

TELOS

Semantic

Layer

Resource

Resource

Resource

Subject Matter

Text, Images, Flash, Applets, Metadata, Interaction Logs

Discussions

Text, Interaction Logs

LO Descriptors

Metadata

Resources

Metadata,

Semantic References

PAMI Research Group, University of Waterloo


Abstract View of Data for Mining


Text (Plain or Markup)


Any resource that contains text is viewed as an abstract text document (some markup can be
preserved to indicate different weights); e.g. HTML page, Word document, email message, discussion
post, even metadata records.


Suitable for text mining, information/metadata extraction, summarization, natural language
processing, semantic/concept analysis, social network analysis.



Numeric Matrix (Vector Space Model)


Requires text mining algorithms to convert the original text to numeric form through feature extraction
and statistical weighting.


Suitable for machine learning algorithms that expect numeric input, especially classification and
clustering algorithms.



Feature Vectors


Suitable for mining images: description, indexing, and retrieval (CBIR).
Requires image processing
algorithms to extract image features.


Also suitable for mining and learning from interaction logs, where each vector describes an event.



Relationship


Provides domain knowledge about data, such as containment (e.g. LO within Course, Post within
Thread) and relatedness (collection of resources, cross
-
referenced LOs).


The extra knowledge could be exploited to improve accuracy, or to apply the same algorithm to
different parts of the data (e.g. generating one summary for entire course, or one summary per
lesson.)

PAMI Research Group, University of Waterloo

Data Representation


What level of granularity


One representation or multiple


Feature representation


Dimensionality issues

PAMI Research Group, University of Waterloo

Document Modeling


Document is represented by a set of concepts
called
“indexing terms”


Document
segmentation


sub
-
word level (decomposition of words and their
morphology)


word level (words and lexical information)


multi
-
word level (phrases and syntactic
information)


semantic level (the meaning of the text)


pragmatic level (the meaning of the text with
respect to the context and situation
-

ontology?)


PAMI Research Group, University of Waterloo

Document Modeling

required

domain knowledge

pragmatic

semantic

multiword

word


sub
-
word

noise &

redundancy

dimensionality

content
-
based

context
-
based

complex

algorithms

PAMI Research Group, University of Waterloo

Document Modeling

pragmatic

semantic

multiword

word


sub
-
word

Term
-
level

(most popular)

Emerging

Not explored

Not usual

PAMI Research Group, University of Waterloo

Document Modeling


Bag
-
of
-
words (VSM): most popular document representation
model


word sequence


weighting terms by their importance (based on frequency)


terms are independent and uncorrelated


Bag
-
of
-
words (VSM):Drawbacks


ignoring term dependencies and correlations


ignoring text structure


ignoring ordering of the words in the document


IR research shows that word ordering is not important.


ignoring grammar


language independent


Solutions: generalized VSM, LSI, Phrase based model,
concept based representation

PAMI Research Group, University of Waterloo

Curse of Dimensionality




the number of training samples is exponential
function of the number of features


For a fixed sample size, increasing the number of
features may degrade the performance (Peaking
Phenomenon)


Limited sample size leads the
overfitting
problem
which implies the
lack of generalization

and low
performance.


PAMI Research Group, University of Waterloo

Dimensionality Reduction


Feature extraction


employing all dimensions and measurement
space to obtain a new transformed space
(compacting feature space without removing any)


identifying important combination of the features (PCA,
manifold learning, SVD and factor analysis)


low dimensional embeddings (random projections)


Pros and Cons

+ promising results

+ solid mathematical background

-

high complexity (time and space)

-
lack of scalability

-
fails in high dimensional problems of data mining

-
extracted features usually have no meaning.




PAMI Research Group, University of Waterloo

Dimensionality Reduction


Feature selection


reducing the feature space dimensionality
by removing useless, redundant, irrelevant
and noise features


it is a problem of searching for a subset of
features among the total number of
features based on one or more
performance index (objective function)

Makrehchi and Kamel, IEEE SMC 07
.


PAMI Research Group, University of Waterloo

New Representation Models


Phrase Based Representation

Document Index Graph
(DIG)


Hammouda and Kamel, KIS 2004, IEEE KDE 2004



Concept Based Representation

Shehata, Karray and Kamel, ICDM 2006, KDD 07, WI07

PAMI Research Group, University of Waterloo

Language Independent
Text
Text
Language
Dependent
l
Semantic Role
Labeler
Syntax Parser
POS Tagger
Language
Dependent
Natural Language Processing
Semantic Role
Labeler
Syntax Parser
POS Tagger
Concept
-
based Model
Sentence Separator
Concept
-
based Statistical
Analyzer
(
tf
:
term frequency
)
(
ctf
:
conceptual term frequency
)
Conceptual Ontological Graph
(
COG
)
Representation
Text Pre
-
processor
Text Pre
-
processor
Concepts
Concepts
Concepts
Concepts
Concept
-
based Mining Model

PAMI Research Group, University of Waterloo

Concept
-
based Statistical Analyzer

Concept
-
based


Document

Similarity

Text

Docs

Text Preprocessing

-

Separate sentences

-

Label terms

-

Remove stop
-
words

-

Stem words

Clustering Techniques

-

Single Pass

-

HAC (ward)

-

HAC (complete)

-

k
-
NN

Cluster

2

Cluster

1

Cluster

3

Concept
-
based

Term Analysis

-

Term frequency (
tf
)

-

Conceptual term


frequency (
ctf
)

PAMI Research Group, University of Waterloo

Evaluation

Single
-
Term

Concept
-
based

Improvement

Reuters

0.723

0.925

+27.94%

ACM

0.697

0.918

+31.70%

Brown

0.581

0.906

+55.93%

F
-
measure of the HAC (Ward) (Higher is better)

Single
-
Term

Concept
-
based

Improvement

Reuters

0.251

0.012

-
95.21%

ACM

0.317

0.043

-
86.43%

Brown

0.385

0.018

-
95.32%

Entropy of the HAC (Ward) (Lower is better)

PAMI Research Group, University of Waterloo

Evaluation
(cont.)

Single
-
Term

Concept
-
based

Improvement

Reuters

0.511

0.917

+79.45%

ACM

0.491

0.891

+81.46%

Brown

0.462

0.902

+95.23%

F
-
measure of the k
-
NN

Single
-
Term

Concept
-
based

Improvement

Reuters

0.348

0.015

-
95.68%

ACM

0.402

0.111

-
29.1%

Brown

0.316

0.023

-
23.03%

Entropy of the k
-
NN

PAMI Research Group, University of Waterloo

Classification


Function that assigns an object to a class


Infer that “object X is about sports”


Automatically learn the function from a set
of examples


Classifier


sports

farming

finance

set of objects

Known Classes

PAMI Research Group, University of Waterloo

Classifiers


Template Matching
:
user need to supply template and metric


NMC:
nearest class mean, simple, no training


K
-
NN:
Asymptotically optimal, slow in testing


Bayes:
yields simple classifier for Gaussian distributions


NN:
nonlinear, sensitive to parameters, slow training


DT:
binary, transparent, sensitive to overtraining


SVM:
nonlinear, insensitive to overtraining, slow, good generalization


PAMI Research Group, University of Waterloo

Multiple Classifier Systems



Multiple classifier systems consist of a set of classifiers and a
combination strategy.


Motivations:


Existence of many alternative classifiers each has its
own feature and representation space


Existence of different training sets collected at different
times and may even have different features.


Each classifier may have good performance in its own
region of the feature space


Classifiers may have different patterns for making
mistakes, even when they are trained on the same data


PAMI Research Group, University of Waterloo

Multiple Classifier Systems Design



Design of MCS can be
accomplished at 4 levels

[Kuncheva 04]



Aggregation Level


Classifier Level


Feature level


Data Level

Classifier
1
Classifier n
Classifier
2
. . .
Aggregation Rule
D
1
Training data
D
n
D
2
X
1
X
2
X
m
PAMI Research Group, University of Waterloo

Combining Schemes


Static vs Adaptive, Fixed vs Trainable


Voting methods: Max, average, majority, Borda


Weighted average, fuzzy integrals, belief theory.


Decision Template, Behavior Knowledge space


Feature Base Architecture (Adaptive) (
Wanas and
Kamel 99
-
02) aggregation is trained and adapts to the data rather
than postprocessing.


Data Level combining: partitioning technique for training
multiple classifiers
(Dara, .. and Kamel IF04, PR 06)

that
generates nearly optimal training partitions

PAMI Research Group, University of Waterloo

Imbalanced Classes

Sun and Kamel, ICDM 2006, PR 2007)


Data Set: 20
-
Newsgroup



Class size ratio: 1/15



Performance measure: F
-
measure



Base classifier: Naïve Bayesian


NB AdaBoost AdaC1 AdaC2 AdaC3

58.25 59.26 64.11 69.08 68.91


F
Acc

97.13 97.98 98.28 98.31 98.42

94.63 96.15 96.73 96.80 97.00


F

Data Set: SchoolNet



Class size ratio: 1/12



Performance measure: F
-
measure



Base classifier: Decision Trees C4.5


C4.5 AdaBoost AdaC1 AdaC2 AdaC3

22.78 31.58 35.16 52.73 53.85


F
Acc

92.50 93.63 92.63 93.35 93.91

86.32 88.34 86.77 88.34 89.24


F
Performance on the small size class


F

F
Performance on the large size class

1.
Performance of the base classifier on the small class is poor

2.
AdaBoost is capable to improve classification accuracy

3.
AdaBoost does not guarantee the improved performance on the small class

4.
AdaC2 and AdaC3 are effective in increasing the identification performance of
the small class

Observations:

PAMI Research Group, University of Waterloo

Dealing with time dependant data


Time series data contains dynamic information
and is difficult to be modelled by any individual
representation methods


Traditional classifiers for time series data like
Dynamic Time Warping (DTW) are not robust


Aggregating the decisions based on different
representations could provide better and more
reliable performances

(Chen and Lei 2004
-
2006)


PAMI Research Group, University of Waterloo

Architecture

PAMI Research Group, University of Waterloo

Experimental Results

PAMI Research Group, University of Waterloo


Finding

groups

of

objects

such

that

objects

in

a

group

are

similar

to

one

another

and

different

from

(dissimilar)

objects

in

other

groups

Inter
-
cluster
distances are
maximized

Intra
-
cluster
distances are
minimized

Clustering

PAMI Research Group, University of Waterloo

Clustering Approaches


Hierarchal
: single link


Partitional:
K
-
means, Fuzzy K
-
means, Bisecting, VQ


Density based:
DBScan, Chameleon


Agglomerative: starts from individual clusters then
merge


Divisive: start from one and divide


Connectionest:
SOM. ART

PAMI Research Group, University of Waterloo

Clusters
Mapping
Method
Ensemble
Combination
Scheme
Ensemble
Summarization
/
Voting
Representation
Combining Method
Partial
/
Local
Clustering
Combined
/
Global
Clustering
Partial
/
Local
Clustering
Partial
/
Local
Clustering
. . .
Generated Cluster
Ensemble
Overview of Combining Cluster Ensembles

Multi
-
clustering

PAMI Research Group, University of Waterloo

Cluster Ensemble


Developed
a prototype for cluster ensemble methods
(Ayad and Kamel 2005
-
2007
)

include:

-

Generation of cluster ensembles based on: (1)
multiple feature subsets, (2) statistical sampling
techniques, and (3) variable number of clusters
(multi
-
resolution ensembles).

-

Combiners of cluster ensembles based on (1)
Shared nearest neighbors, (2) Different
representations and distance measures between
clusters, and (3) Voting.



Positive experimental results on text data, in addition
to a variety of benchmark data for machine learning
algorithms


PAMI Research Group, University of Waterloo


Categorization using cluster ensemble

Dataset

#
samples

#
attributes

#
classes

K
-
means’ Mean
Error Rate in %

Ensemble’s Mean

Error Rate in %

Synthetic1

1000

8

5

17.41

0

Yahoo! (text)

2340

1458

6

38.23

16.24

Texture (image)

5500

40

11

37.99

11.54

Optical Digit
Recognition

500

64

10

27.31

16.40

PAMI Research Group, University of Waterloo

Projects Overview

Text
Document

Information Extraction

Analyzing content to extract relevant information

Keyword Extraction

Summarization

Concept Extraction

Social Network Analysis

Categorization

Organizing LOs according to their content

Text
Document

Classification

Clustering

-

Traditional

-

MCS

-

Imbalanced

-

Traditional

-

Ensembles

-

Distributed

Personalization

Providing user
-
specific results

Reinforcement

Learning

-

Traditional

-

Opposition
-


based

Image Mining

Describing and finding relevant images

CBIR

-

Traditional

-

Fusion
-
based

Image

Interaction
Logs

Integration and Applications

In Progress

Publications

Theme and Industry Collaboration

Software Components

PAMI Research Group, University of Waterloo

Information Extraction: Summarization

LO Content Package Summarization



Learning objects stored in IMS content pacakges
are loaded and parsed. Textual content files are
extracted for analysis.



Statistical term weighting and sentence ranking are
performed on each document, and to the whole
collection.



Top relevant sentences are extracted for each
document.



Planned functionality: Summarization of whole
modules or lessons (as opposed to single
documents).



Benefits


Provide summarized overview of learning objects for
quick browsing and access to learning material.



Scenarios


Learning Management Systems can call the
summarization component to produce summaries
for content packages.

Data is courtesy University of Saskatchewan

PAMI Research Group, University of Waterloo

Information Extraction: Social Network Analysis

Social Network Builder



Tasks


Finding relationships between people based on their web pages



Progress


Modeling


Actors are represented by their associated documents


Links are modeled by


Pair
-
wise Similarity of the actors’ documents


Merging actors’ documents


relations are also modeled by
documents


Learning


Some links are known


learning social network is translated into
text classification problem


No link is revealed


a clustering problem with very low
performance


PAMI Research Group, University of Waterloo

Information Extraction: Concept Extraction

Language Independent
Text
Text
Language
Dependent
l
Semantic Role
Labeler
Syntax Parser
POS Tagger
Language
Dependent
Natural Language Processing
Semantic Parser
Syntax Parser
POS Tagger
Concept
-
based Model
Sentence Separator
Concept
-
based Statistical
Analyzer
(
tf
:
term frequency
)
(
ctf
:
conceptual term frequency
)
Conceptual Ontological Graph
(
COG
)
Representation
Text Pre
-
processor
Text Pre
-
processor
Concepts
Concepts
Concepts
Concepts
F
-
measure of Hierarchical Clustering

Single
-
Term

Concept
-
based

Improvement

Reuters

0.723

0.925

+27.94%

ACM

0.697

0.918

+31.70%

Brown

0.581

0.906

+55.93%

Entropy of Hierarchical Clustering

Single
-
Term

Concept
-
based

Improvement

Reuters

0.251

0.012

-
95.21%

ACM

0.317

0.043

-
86.43%

Brown

0.385

0.018

-
95.32%

Precision of Search

Single
-
Term

Concept
-
based

Improvement

Cran

0.536

0.901

+68.09%

Reuters

0.591

0.897

+51.77%

Recall of Search Result

Single
-
Term

Concept
-
based

Improvement

Cran

0.486

0.827

+70.16%

Reuters

0.452

0.841

+86.06%

Concept
-
Based Statistical Analyser

Conceptual Ontological Graph (COG) Ranking

PAMI Research Group, University of Waterloo

Information Extraction: Keyword Extraction

Semantic Keyword Extraction



Tasks


Developing tools and techniques to extract semantic keywords
toward facilitating metadata generation


Developing algorithms to enrich metadata (tags) which can be
applied in index
-
based multimedia retrieval



Progress


Proposed a new information theoretic inclusion index to measure
the asymmetric dependency between terms (and concepts),
which can be used in term selection (keyword extraction) and
taxonomy extraction (pseudo ontology)



Makrehchi, M. and Kamel, ICDM07, WI 07


PAMI Research Group, University of Waterloo

Information Extraction: Keyword Extraction


Learn rules to find keywords in English
sentences


Rules represent sentence fragments


Specific enough for reliable keyword
extraction


General enough to be applied to
unseen sentences


Rule generalization


Begin with an exact sentence
fragment


Merge with another by moving
different words to the lowest common
level in the part
-
of
-
speech hierarchy


Keep merged rule if it does not
reduce precision and recall of
keyword extraction; keep original
rules otherwise


Keyword extraction


Find sequence of rules that best
cover an unseen sentence


Extract keywords according to rules



Rule base size shows quick initial growth, followed
by slow and irregular growth and rule elimination


Learns 20 rules from the first 50 training rules


Learns 13 additional rules from the next 220
training rules


Both precision and recall values increase
during training


Precision (blue) increases 10%


Recall (red) shows slight upward trend

Rule
-
based Keyword Extraction

PAMI Research Group, University of Waterloo

Categorization: Ensemble
-
based Clustering


Consensus Clustering


Categorization of learning objects using proposed consensus clustering
algorithms.


The goal of consensus clustering is to find a clustering of the data objects
that optimally summarizes an ensemble of multiple clusterings.


Consensus clustering can offer several advantages over a single data
clustering, such as the improvement of clustering accuracy, enhancing the
scalability of clustering algorithms to large volumes of data objects, and
enhancing the robustness by reducing the sensitivity to outlier data objects
or noisy attributes.


Tasks


Development of techniques for producing ensembles of multiple data
clusterings where diverse information about the structure of the data is
likely to occur.


Development of consensus algorithms to aggregate the individual
clusterings.


Develop solutions for the cluster symbolic
-
label matching problem


Empirical analysis on real
-
world data and validation of proposed method.



PAMI Research Group, University of Waterloo


Categorization using cluster ensemble

Dataset

#
samples

#
attributes

#
classes

K
-
means’ Mean
Error Rate in %

Ensemble’s Mean

Error Rate in %

Synthetic1

1000

8

5

17.41

0

Yahoo! (text)

2340

1458

6

38.23

16.24

Texture (image)

5500

40

11

37.99

11.54

Optical Digit
Recognition

500

64

10

27.31

16.40

PAMI Research Group, University of Waterloo

Distributed Environments


Distributed Data Mining

Applying Data Mining in an environment where the data, the mining process, or both
are
distributed
.



Motivation


Natural distribution of data on the Web.



Scenarios that require the integration of disparate data and mining results
are emerging (e.g. federation of repositories, news feed aggregation, digital
libraries, business intelligence gathering, etc.)



Emerging technologies, such as Semantic Web, Web Services, Grid
Computing, make it feasible to build distributed mining systems.



Availability of cheap low
-
end hardware that could be utilized in a distributed
environment to achieve high
-
end goals (e.g. Google, SETI@Home,
Folding@Home, etc.)



PAMI Research Group, University of Waterloo

Categorization: Distributed Clustering



Peer nodes are arranged into groups called
“neighborhoods”.



Multiple neighborhoods are formed at each level of
the hierarchy.



This size of each neighborhood is determined
through a network partitioning factor.



Each neighborhood has a designated supernode.



Supernodes of level h form the neibhorhoods for
level h+1.



Clustering is done within neighborhood boundaries,
then is merged up the hierarchy through the
supernodes.



Benefits


Significant speedup over centralized clustering and
flat peer
-
to
-
peer clustering.


Multiple levels of clusters.


Distributed summarization of clusters using
CorePhrase keyphrase extraction.



Scenarios


Distributed knowledge discovery in hierarchical
organizations.

Neighborhood
(
Q
)
SuperNode
(
S
)
h
=
0
h
=
1
h
=
2
Root
h
=
H
-
1
h
=
H
h
=
0
β
=
0
.
2
h
=
1
β
=
0
.
33
h
=
2
β
=
0
h
=
3
}
,
,
{
}
,
,
{
)
0
(
4
)
0
(
1
)
0
(
)
0
(
16
)
0
(
1
)
0
(
Q
Q
p
p




Q
P
}
,
{
}
,
,
,
{
)
1
(
2
)
1
(
1
)
1
(
)
1
(
4
)
1
(
3
)
1
(
2
)
1
(
1
)
1
(
Q
Q
p
p
p
p


Q
P
}
{
}
,
{
)
2
(
1
)
2
(
)
2
(
2
)
2
(
1
)
2
(
Q
p
p


Q
P
HP2PC Architecture

HP2PC Example

3
-
level network, 16 nodes

Hierarchical P2P Document Clustering

PAMI Research Group, University of Waterloo

Categorization: Multiple Classifier Systems


Tasks


To investigate various aspects of
cooperation in Multiple Classifier
Systems (Classifier Ensembles)



To develop evaluation measures in
order to estimate various types of
cooperation in the system



To gain insight into the impact of
changes in the cooperative
components with respect to system
performance using the proposed
evaluation measures



To apply these findings to optimize
existing ensemble methods



To apply these findings to develop
novel ensemble methods with the
goal of improving classification
accuracy and reducing computation
complexity


Progress


Proposed a set of evaluation
measures to select sub
-
optimal
training partitions for training
classifier ensembles.



Proposed an ensemble training
algorithm called Clustering, De
-
clustering, and Selection (CDS).



Proposed and optimized a
cooperative training algorithm called
Cooperative Clustering, De
-
clustering, and Selection (CO
-
CDS).



Investigated the applications of
proposed training methods (CDS
and CO
-
CDS) on LO classification.


PAMI Research Group, University of Waterloo

Categorization: Imbalanced Class Distribution


Objective


Advance classification of multi
-
class imbalanced data



Tasks



To develop cost
-
sensitive boosting algorithm AdaC2.M1



To improve the identification performance on the important
classes



To balance classification performance among several classes


PAMI Research Group, University of Waterloo

Categorization: Imbalanced Class Distribution

Ind.

size

Dist.

C1

49

7.84%

C2

288

46.08%

C3

288

46.08%

Class Distribution

C4.5

HPWR (Od=3)

class

Meas.

Base

AdaBoost

Base

AdaBoost


C1

R

0

5.11

10.70

44.06

P

N/A

6.5

11.82

32.89

F

N/A

5.84

10.83

35.84


C2

R

73.21

92.28

88.31

87.43

P

69.53

88.75

86.79

91.99

F

72.29

90.38

87.43

89.64


C3

R

67.94

91.36

87.63

88.42

P

73.89

87.88

87.07

89.91

F

71.91

89.42

86.99

89.03

G
-
measure

0

11.46

33.32

68.50

Performance of Base Classification and AdaBoost


C4.5

HPWR (Od=3)

Class

Meas.

Base

AdaBoost

AdaC2.M1

Base

AdaBoost

AdaC2.M1

C1

R

0

5.11

77.58

10.70

44.06

65.72

P

N/A

6.50

14.12

11.82

32.89

30.83

C2

R

73.21

92.28

64.73

88.31

87.43

83.12

P

69.53

88.75

97.24

86.79

91.99

91.38

C3

R

67.94

91.36

65.23

87.63

88.42

83.95

P

73.89

87.88

93.22

87.07

89.91

90.81

G
-
mean

0

11.46

68.42

33.32

68.50

76.08

Balanced performance among classes
-

Evaluated by G
-
mean

PAMI Research Group, University of Waterloo

Personalization


Opposition
-
based Reinforcement Learning for
Personalizing Image Search



Developing a reliable technique to assist users, facilitate and
enhance the learning process



Personalized ORL tool assists user to observe the searched
images desirable for her/him



Personalized tool gathers images of the searched results,
selects a sample of them



By interacting with user and presenting the sample, it learns
the user’s preferences

PAMI Research Group, University of Waterloo

Personalization

PAMI Research Group, University of Waterloo

Personalization

Opposition
-
based

RL algorithms:


OQ(lambda)

(International


Joint Conference

on Neural Networks
-
2006)


and

NOQ(lambda)

(IEEE Symposium

on Approximate

Dynamic Programming


and Reinforcement Learning


2007)


PAMI Research Group, University of Waterloo

Image Mining: CBIR


Content based image retrieval


Build an IR system that can retrieve images based on:
Textual Cues, Image content, NL Queries


images

Rich

Documents


Documents contain QI


Images match QI


NL Description of Image


Images contain QT


Automated image tagging

Image Retrieval

Tool Set


Query Image QI


Query Text QT


Query Document

PAMI Research Group, University of Waterloo

Accuracy
= 70%


Accuracy
= 55%


Accuracy
= 60%


Accuracy
= 95%


IZM

FD

MTAR

The proposed approach

x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Illustrative

Example

PAMI Research Group, University of Waterloo

The Performance of the proposed approach

Experimental Results (Cont’d)

PAMI Research Group, University of Waterloo

Image Mining: CBIR

Interface Module to TELOS

TELOS

IKB
-
BLDR

LOR

Image

Admission

Interface

LO

Image

Repository

Compound
Document

Image

TELOS

IR

Image

Compound
Document

Text

Query

Response

PAMI Research Group, University of Waterloo

Integration and Applications


Progress



Finished core parts of the common data mining
framework.



Built components and services from theme researchers’
work around the data mining framework.



Provided documentation for the data mining framework
and software components.



Launched web site to host components and
documentation from Theme 4:

http://pami.uwaterloo.ca/projects/lornet/software/

PAMI Research Group, University of Waterloo

Integration and Applications


Progress



Core parts of the common data mining framework are available,
including:


Vector and matrix manipulation.


Document parsing and tokenization.


Statistical term and sentence analysis.


Similarity calculation using multiple distance functions.


IMS Content Package compliant parser.



Components and tools built around the common data mining
framework:


Metadata extraction from single documents; supports Dublin Core encoding.


Document similarity calculation using cosine similarity.


Single document and content package summarization.


Building of standard text datasets from large document collections.



Integration with TELOS:


Developed C# TELOS connector for integrating Theme 4 components.


Worked on component manifest specification with Theme 6.


Provided metadata extraction as part of a complete scenario for TELOS components integration.


The following components were wrapped for use by TELOS through the C# connector: Automatic
Metadata Extractor, Document Similarity, and Document Summarizer.

PAMI Research Group, University of Waterloo

Theme and Industry Collaboration


Other LORNET themes


Providing tools for concept
-
based metadata extraction to SFU and U of Saskatchewan.


Providing tools for semantic
-
based ontology representation to SFU.


Providing tools for searching course content and discussion data provided by U of
Saskatchewan.


Providing tools for comparing between course content and discussion board data
provided by U of Saskatchewan.



Industry


Pattern Discovery Software (PDS) provided data mining software tools for use by
researchers.


Vestech provided opportunities for researchers to work on speech technologies.


Desire2Learn opened job opportunities for LORNET researchers.

PAMI Research Group, University of Waterloo

Software Components

Learning Object
Repository


Metadata


Structured Text


Categorical

e
-
Learning
Environment


Structured Text


Images


Object Relationships


Context


Automatic
metadata extraction


LO automatic
classification


LO
organization

through clustering


Multiple organization strategies

through
cluster ensembles


Extracting concepts

from LO


Summarizing

Documents


Grouping

LOs


Tagging

LOs


Discovering
Similar Topics


Discovering
Similar Peers


Building
Social Networks


Detecting Plagiarism


LO
recommendation

using similarity ranking


Personalization / Specialization

through
reinforcement learning

Legend


Integrated


Ready


In Progress


Year
5

TELOS


Metadata


Ontology


Ontology construction

and unification


Finding relations

between components


Ranking

components


Grouping

components


Tagging

components


General Tools


C# Connector for TELOS


Common Data Mining Framework



Standard Text Mining Tools


Metadata Extractor


Document Summarizer


Content Package Summarizer


Document Similarity


LO Recommender


Metadata Harvester


Keyword Extractor


Taxonomy Extractor


Metadata Enrichment Tools



Concept
-
based and Semantic Text Mining
Tools


Metadata Extractor


LO Search Engine


Document Similarity


Document Classifier


Document Clusterer


Semantic
-
based Ontology
Representation


Semantic Metadata Matching


POS Rule
-
Learning System


Triplet Representation System



Categorization Tools


LO Classifier


LO Multiple Classifier


LO Clusterer


LO Ensemble Clusterer


LO Consensus Clusterer


LO Distributed Clusterer


Overview of Components

Environment

Data Types

Tasks

Scenarios for Use of Software Components


User
-
centric Tools


Personalized Search Engine


Social Network Learner



Image Mining Tools


Content
-
based Image Search


Personalized Image Search


Consensus
-
based Fusion for Image Retrieval

PAMI Research Group, University of Waterloo

Publications

Papers

(accepted / published)

Papers

(submitted / in prep)

Theses

(completed / in progress)

4.1 Information Extraction from
Text

11

7

3/2

4.2 Semantic Knowledge
Synthesis from Text

10

4

4/1

4.3 Knowledge Discovery through
Categorization

12

10

4/1

4.4 Knowledge from Interaction

8

3

1/2

4.5 Knowledge from Image Mining

10

3

2/1

Total

51

27

14//7 = 21

PAMI Research Group, University of Waterloo

Theme 4 Team

Leader: M. Kamel


PI’s:


Dr. Basir


Dr. Tizhoosh



Researchers


H. Ayad


R. Kashef


A. Ghazel


Dr. Makhreshi



Funding


CRC/CFI/OIT


NSERC


PAMI Lab




Dr. Karray


Asso PI (Wong,
DiMarco



M. Shokri


S. Hassan


A. Farahat


Dr. R. Khoury




PDS,


Vestech,


Desire2Learn



Graduated


R. Khoury, PhD 07


L. Chen, PhD 07


M. Makhreshi,PhD 07


K.Hammouda,PhD 07


R. Dara, PhD 07


Y.Sun, PhD 07


K. Shaban, PhD 06


Y. Sun, PhD 06


M. Hussin, PhD 05


Jan Bakus, PhD 05


A. Adegorite, MA.Sc04


A. Khandani, MA.Sc05.


S. Podder, MA.Sc.04

PAMI Research Group, University of Waterloo


Pattern Analysis and Machine Intelligence Lab


Electrical and Computer Engineering

University of Waterloo

Canada


www.pami.uwaterloo.ca

www.
pami
.uwaterloo.ca/projects/lornet/software/



www.pami.uwaterloo.ca/kamel.html

publications