A short introduction to sequential data mining

levelsordΔιαχείριση Δεδομένων

20 Νοε 2013 (πριν από 3 χρόνια και 9 μήνες)

83 εμφανίσεις

A Short Introduction to Sequential
Data Mining

Koji IWANUMA

Hidetomo NABESHIMA


University of Yamanashi



The First Franco
-
Japanese Symposium on Knowledge Discovery in
System Biology, September 17, Aix
-
en
-
Provence

2

Two Main Frameworks of Sequential
Mining

Sequential pattern mining for
multiple
data sequences






Sequential pattern mining for a
single

data sequence



Sequence ID

Purchase data record

1

<bread, cheese>

2

<(wheat, milk), bread,

(berry, sausage)>

3

<(bread, pumpkin, sausage)>

4

<bread, cheese, sausage>

5

<cheese>

Data sequence

<S1 S2 S3 S4 S5 S6 S7 … … Sn>

3

What Is Sequential Pattern Mining?

Given a set of sequences, find the complete
set of
frequent
subsequences

A
sequence database


A
sequence
: < (ef) (ab) (df) c b >

An element may contain a set of items.

Items within an element are unordered

and we list them alphabetically.


<a(bc)dc> is a
subsequence

of
<
a
(a
bc
)(ac)
d
(
c
f)>

Given
support threshold

min_sup
=2, <(ab)c> is a
sequential pattern

SID

sequence

10

<a(
ab
c)(a
c
)d(cf)>

20

<(ad)c(bc)(ae)>

30

<(ef)(
ab
)(df)
c
b>

40

<eg(af)cbc>

J. Han and M. Kamber. Data Mining: Concepts and Techniques, www.cs.uiuc.edu/~hanji

4

Challenges on Sequential Pattern Mining

A
huge

number of possible sequential patterns are hidden
in databases

A mining algorithm should

find the
complete set of patterns
, when possible,
satisfying the minimum support (frequency) threshold

be highly
efficient, scalable
, involving only a small
number of database scans

be able to incorporate various kinds of
user
-
specific

constraints

J. Han and M. Kamber. Data Mining: Concepts and Techniques, www.cs.uiuc.edu/~hanji

5

Sequential Pattern Mining Algorithms
for Multiple Data Sequences

Apriori
-
based method:
GSP
(Generalized Sequential Patterns:
Srikant & Agrawal @ EDBT’96)

Pattern
-
growth methods: FreeSpan &
PrefixSpan

(Han et
al.@KDD’00; Pei, et al.@ICDE’01)

Vertical format
-
based mining:
SPADE

(Zaki@Machine Leanining’00)

Constraint
-
based sequential pattern mining (SPIRIT: Garofalakis,
Rastogi, Shim@VLDB’99; Pei, Han, Wang @ CIKM’02)

Mining closed sequential patterns:
CloSpan

(Yan, Han & Afshar
@SDM’03)

J. Han and M. Kamber. Data Mining: Concepts and Techniques, www.cs.uiuc.edu/~hanji

6

Mining Sequential Patterns from a
Very
-
Long Single Sequence

<

>

typhoon

flood,

landslide

typhoon

flood,

landslide

A series of daily news paper articles

<
typhoon

(
flood,

landslide
)>

7

Sequential Pattern Mining Algorithms
for a Single data Sequence


Discovery of frequent
episodes

in event sequences, based
on a
sliding window system

[Mannila 1998]



The frequency measure becomes anti
-
monotonic, but has a
problem, i.e., a duplicate counting of an occurrence.

Asynchronous
periodic pattern

mining [Yang et.al 2000,
Huang 2004]



Any anti
-
monotonic frequency measures are not

investigated.

On
-
line approximation
algorithm for mining frequent items,
not for frequent subsequences

Lossy counting algorithm [Manku and Motwani, VLDB’02]

8

Research in Our Laboratory


Sequential Data Mining from a very
-
large single
data sequence.


Main target:
sequential textual data
, especially,
newspaper
-
articles corpora


Objectives: to generate a robust and useful large
-
scale event
-
sequences corpus.

Application 1


topic tracking/detection in information retrieval.


Application 2


automated content
-
tracking in WEB
.

Application 3: scenario/story semi
-
automatic creation



Ordinary temporal data analysis: various log
data in computer systems, genetic information,
etc.

9

Technical Topics (1/2)

A new framework for extracting frequent
subsequences from a single long data
sequence:


in
IEEE Inter. Conf. on Data Mining 2005 (ICDM2005):


A new rational frequency measures, which
satisfies the
Apriori
(anti
-
monotonic)

property
and has
no duplicate counting.


A fast on
-
line algorithm for a some limited
case

10

Technical Topics (1/2)


On
-
going current works and future work

On
-
line rational filters based on
confidence

criteria and/or
i
nformation
-
gain

for eliminating redundant valueless
sequences from system output


Methods for finding
meta
-
structures

embedded in huge
amount of frequent sequences generated by a system

A method using compression based on context
-
free grammar
-
inference/learning



More fast extraction algorithm based on a method for
simultaneously searching multiple strings over compressed
data.


11

References:

Jiawei Han and Micheline Kamber.
Data Mining: Concepts and Techniques
(Chapter 8). www.cs.uiuc.edu/~hanj

12

Thanks for your attention!!