Data Mining - Helsinki Institute for Information Technology HIIT

sentencehuddleΔιαχείριση Δεδομένων

20 Νοε 2013 (πριν από 3 χρόνια και 6 μήνες)

53 εμφανίσεις

Data mining: theory and
applications

Heikki Mannila

Data Mining:

Theory and Applications



Data analysis becoming more important in other
sciences and in industry


New

measurement

methods


Ability

to

store

data



High
-
dimensional

large

data

sets


Non
-
traditional

forms

(e
.
g
.
,

strings,

trees,

graphs)


Data

analysis

lags

behind

Data mining



Has

emerged

as

a

major

research

area

in

the

interface

of

computer

science

and

statistics


Machine

learning,

databases,

algorithms



Data

analysis

questions

are

increasingly

visible

in

database

and

algorithms

research



Theory

and

practice

interact


Goals


Develop

novel

data

analysis

techniques

for

the

use

of

other

sciences

and

industry


How?


Look

at

data

analysis

problems

arising

in

practice


Abstract

new

computational

concepts

from

them


Analyse

the

concepts

and

develops

new

computational

methods


Take

the

results

into

practice


Theoretical work in algorithms and foundations of data
analysis can have fast impact in the application areas


The applications feed interesting novel questions to
theoretical research

Major themes in methods


Pattern

discovery



Methods

for

sequence

decomposition



Interplay

of

combinatorial

and

continuous

methods

in

data

mining



Techniques

for

the

decomposition

of

large

0
-
1

data

sets
.


Application areas


Genome

structure



Gene

expression

data

analysis



Palaeontology



Linguistic

applications



Ubiquitous

computing

The people


Heikki Mannila, Hannu Toivonen, Jaakko Hollmen,
Aristides Gionis, Floris Geerts, Bart Goethals



6

Ph
.
D
.

students



A

visible

position

in

the

international

community

Highlights


Finding recurrent sources in sequences


global structure in genomic sequences


recognizing recurrent contexts in mobile device usage


(k,h)
-
segmentation


Finding orderings of attributes from unordered binary data


Fragements of order


Spectral ordering techniques


Pattern discovery and mixture modelling techniques for
onomastic data sets


Methods for finding topics in 0
-
1 datasets on the basis of
co
-
occurrence information

Finding recurrent sources

in sequences


Sequences


DNA


Telecommunications


Etc.


How to find some global structure from a sequence?


Try to find homogenous segments from the sequence

Finding homogenous segments


Sequence
T
, integer
k


Measure of homogeneity
H

for segments of
T


E.g.,
H(S) = |S| Var(S)



Find the division
T =
S
1
,S
2
,…,
S
k

minimizing


Dynamic programming


(k,k)
-
segmentation
: k
-
segments with no relationship to each
other; independent
sources



k
i
i
S
H
1
)
(
T =
S
1

S
2
S
3

S
4
S
5
S
6

(k,h)
-
segmentation


We want to limit the number of different
types

of
segments



Only
h<k

different types are allowed



Find the best segmentation of
T

into
k

segments by
using only
h

different types of segments


(6,3)
-
segmentation

Source 1

Source 2

Source 3

Data

k

= 3 and
h

= 3

k

= 3 and
h

= 2

(k,h)
-
segmentation problem


Given sequence
T


Find
h
sources

w
1
,w
2
,…,
w
h



A
decomposition of sequence
T
into

k
segments



T =
S
1

S
2


S
k



Minimizing the sum of distances from each point
t

to the
source
w
a(t)

of the segment to which
t

belongs to









n
i
t
a
i
i
w
t
1
2
)
(
)
(
Results



(k,h)
-
segmentation problem is NP
-
hard for dimension
d>1
, for
L
1

and
L
2

metrics



Dimension d=1: complexity open


Simple approximation algorithms


d=1: 3
-
approximation for
L
1


d=1:
-
approximation for
L
2


d>1: 3+e

approximation for
L
1

for any e>0


d>1: A+2

approximation for
L
2
, where A is the best
approximation factor for k
-
means clustering


Very good performance in practice


The algorithms work for any generative model (not just reals
with
L
p
metrics)

5
5
Example: onomastic data


Names of lakes in Finland


About 150,000 lakes


What are the main trends?


High
-
dimensional marked point process



Collaboration with Research Center for the Languages
of Finland (Kotus)



Similar data analysis problems arise also in
environmental sciences


Clustering on the basis of the names of lakes

Similarity with the names of lakes in Kangasala

Example: paleontological data


Given a matrix of occurrences of species in fossil sites


Ages of the fossil sites are not available


How to order the sites according to their age?


Background information: species arrive and vanish


Try to find ordering that minimizes Lazarus events






species




A

B

C



0

0

1



1

1

0



1

0

1



0

1

0

time

Lazarus events

Methods


Spectral ordering: form a Laplacian of the co
-
occurrence matrix, look at eigenvectors



Fragments of order: find short segments of orders
which are not violated by observations



Other applications: text analysis, telecommunications

Fortelius, Jernvall, Gionis, Mannila, in preparation


Future research directions


Theory

and

practice



The combination of continuous and combinatorial
methods


Concepts and algorithms for describing structure of
sequences


Methods for pattern discovery in and modelling of
spatiotemporal data


Theoretical models for data mining (such as inductive
databases)


Foundational issues in pattern discovery (e.g., logical
form of patterns and the difficulty in discovering them)



Publications, collaborations, software releases

Applications in the future


Genome structure and its relation to function



Linguistic applications: spatial and temporal variation in
language



Ubiquitous computing and telecommunications
applications



Paleontological and ecological applications

Mobile Computing Research

at HIIT

Kimmo Raatikainen

Research Director

Helsinki Institute for Information Technology

kimmo.raatikainen@hiit.fi

Fuego Mission


To address the research
challenges arising in mobile
computing systems and
applications of tomorrow.


Mobile computing will fulfil the
vision of ubiquitous
-

invisible
-

computing providing access
and services anytime,
anywhere, and anyhow.


The key research challenges
are related to


context
-
awareness,


reconfigurability,


adaptability,


understanding user needs
and experience, and


personalization.


Any technology distinguishable from magic is
insufficiently advanced,” Gregory Benford

Present State


Some 20 researchers organised in two closely co
-
operating
research groups


Mobile Computing Group (Prof. Kimmo Raatikainen)


User Experience Research Group (Prof. Martti Mäntylä)


Other senior researchers and post
-
docs:


Dr. Ken Rimey (software technologies, distributed computing)


Dr. Pekka Nikander, permanent visitor from Ericsson
Research (security and privacy in Mobile Internet)


Dr. Timo Saari (user experience research, media science)


Dr. Jan Lindström (distributed data management, mobile data)


Other postdocs likely to be hired 2004

Current Research Topics


Middleware for Mobile Wireless Internet


Fuego Core project


Mobile distributed event system


Mobile (XML
-
based) file system with intelligent
synchronization


SOAP messaging over wireless (W3C: XML Binary Infoset)


Mobile Presence


Host Identity Protocol


Personal Distributed Information Storage


PDIS project


Synchronization
-
based peer
-
to
-
peer infrastructure for storage
of structured XML data: PIM data, metadata for digital media


Context Recognition by User Situation Data Analysis


CONTEXT
project


Bridge between User Experience Group at ARU and Adaptive
Computing Systems Group at BRU


Software Architectures for Configurable Ubiquitous Systems


Sarcous project by SoberIT at HUT


Managing the large variety of software products

Targets to 2005
-
2010


1/3


to enlarge and strengthen international co
-
operation


current: WWRF, UCB, Fraunhofer FOKUS


new: Japan, KCL/Mobile VCE, an European NoE,
CMU, …


but not forgetting co
-
operation in Finland:


HUT, UHE, Tampere Univ Tech, Univ Oulu,
UIAH, …


to contribute to software architecture for Wireless World


to address challenges due to personal networking


minimal differences between solution stacks for
ad
-
hoc communities and networked infrastructure


peer
-
to
-
peer, device
-
to
-
device solutions

Not in primary focus

Not in primary focus but perhaps latter

(and other smart places)

Targets to 2005
-
2010


2/3


to put more focus on infrastructure for context
-
awareness and dynamic (end
-
user) systems


context modelling: presentation, maintenance,
sharing, protection, reasoning, and queries


decision rules for reconfiguration


reflective (self
-
aware) middleware for personal
networking


Fault tolerance in Wireless World


traditional exception will be the usual case


compensations, delayed/delegated actions, …


Trust and privacy in Wireless World

Targets to 2005
-
2010


3/3


user needs and novel application concepts


human factors of the Wireless World


basic psychosocial mechanisms


what makes a service use experience engaging
and sustaining?


user
-
centric concept design (UCPCD)


process, methods, tools


novel application concepts based on context
-
awareness, other novel technologies


experience prototypes