Ubiquitous Data Mining

unknownlippsΤεχνίτη Νοημοσύνη και Ρομποτική

16 Οκτ 2013 (πριν από 4 χρόνια και 27 μέρες)

63 εμφανίσεις

Ubiquitous Data Mining



Dr. Susanna Pirttikangas


Intelligent Systems Group (ISG)

Dept. Electrical and Information Engineering

University of Oulu

Finland

Outline


Data mining, Ubiquitous computing


Ubiquitous Data Mining


Test Planning in UDM


Online Data Streams


Pattern Recognition


Visualization


Tools


Conclusions and Future directions



Data Mining

Scianta Intelligence: “Data Mining, also called KnowledgeDiscovery, is a
general term for a variety of interlocking technologies that, used together,
find, isolate, and quantify patterns hidden in large and often disparate
collections of data. As a general knowledge extraction process, its primary
goal is the discovery of nontrivial and potentially valuable hidden in local
files, databases, and in repositories scattered across distributed networks.“


Ubiquitous Computing

Ubiquitous Computing



People

Places

Networks

Services

Other machines

etc.





Improving human machine interaction,

providing right information in right situation

From Henry Tirri’s Presentation at
PerComm
2007


What sort of raw (context) data management problem are we facing at Nokia ?


A multidimensional (
2
-
30
) vector of real values


Frequency
0.5
s
-
1
day


Typically ”always
-
on”


A
1
-
4
M pixel image


Frequency
10
min


week(s)


Very irregular, high intensity bursts (many images within minutes)


A
100
K
-
1
M sound file


Frequency
1
min


days


Irregular; streaming


Naturally many application domains require a mixture of these

10
K phones


vector every
2
min results in
2.7
billion
vectors/year

200
M phones


vector every
60
min results in about
10
^
12
vectors/year

Association Rule Algorithm: Apriori

H. Mannila et al

Cigarettes

Diapers

Beer

Noodels

Juice

T1

1

0

1

1

0

T2

0

1

1

0

1

T3

1

1

1

0

1

T4

0

0

1

1

0

T5

1

0

0

0

0

T6

1

0

0

0

1


A customer who buys beer and sausages will also buy diapers
with a probability of
0.85
.”

Whenever a transaction T contains X,

then T probably also contains Y.


Time

Location1

LivingRoom

Location1

Office

1

1

0

2

1

0

3

1

0

4

1

0

From transactions to continuous flow of data

Locationing system?

Walking

TV on

Remote

Used

1

1

1

1

1

x

Activity Rec

Artefact Usage

Nakamura et al: Mana
2007
(SWDMNSS)

Real World Oriented Application

Query Processing

Database

Recognition

Syncronous Control

Sensor Cycle Set

Sensors

Mana

Ubiquitous Data Mining (1/2)


Performing analysis of data in mobile, embedded and ubiquitous
devices.


Communication; network characteristics


Computation; intensive


Changes over time


Archiving


Energy consumption of mobile devices or sensors


Memory requirements


Result accuracy, data loss


Transferring and presenting results for the user


Security; sharing, privacy


Test Planning in UDM


User scenarios


What do we do with all the devices?


What devices do we utilize?


Sampling frequency


The equipment set restrictions


What to collect?


How much to collect?


Pattern recognition

Online Data Streams: Segmentation Problem

Clear starting and ending point for an event

Thresholding + SSMM


OFFLINE:First a
piecewise linear
approximation of an
example footstep
pattern is constructed


ONLINE: When a
sudden increase in
the energy of the
EMFi
-
signal is
detected the pattern
matching begins


A Viterbi
-
like
algorithm is used to
detect the
occurrences of
patterns similar (or
similar enough) to the
created footstep
model


(Body) Sensor Network,

Activity Recognition and Artefact User Identification

Pattern recognition

Theodoridis and Koutroumbas (1999):

”Pattern recognition is the scientific discipline whose goal is the classification of
objects into a number of categories or classes. Depending on the application,
these objects can be images or signal waveforms or any type of
measurements that need to be classified.”


input

sensing

segmentation

feature

extraction

classification

post

processing

decision

Data collection


Collect in a natural environment?


Requires the direct observation by the researchers,


Is expensive and impossible for larger populations.


The diaries will include errors


The testees need to report their activities


The testees will forget to write activities down


MIT experience sampling method : requires interruptions


Some activities do not occur on a daily basis.


Ask the testees to do the activities


Semi
-
naturalistic data collection


Intille et al
, MIT (
2004
)


The activities are disguised as goals in an obstacle course to minimize the
testees awareness of data collection.

Data Collection Tools


The testee can determine


when to collect and


where to collect


The testee can detect if something
went wrong (connection lost)


No need to carry a mobile device in the
hand


Sound alerts for failure


Activity recognition


clean whiteboard


read a newspaper


stand still


sit and relax


sit and watch TV


drink


brush teeth


lie down


vacuum clean


type


walk


climb stairs


descend stairs


elevator up


elevator down


run


cycle


Feature Extraction and Selection


Know what you are dealing with


Between classes


What are the discriminative attributes for different classes


What are the common attributes for the same class


With many features: ``curse of dimensionality''


If too few features
-
> not enough information to describe the
phenomena



If a very complex situation, calculate many features


Feature selection


Subset selection



branch
-
and
-
bound


forward search


backward search


Feasible to utilize a simple and light algorithm (kNN)



Location Data, Visualization

<logentry>


<header>


<date>
30
-
09
-
2003T14:29:44
</date>


<module>


<name></name> <version></version>


</module>


<session>


<id>
216
</id>


<username>
seppo
</username>


</session>


</header>


<body>


<userAttributeChangeEvent>


<location>


<longitude>
25.468917078116988
</longitude>


<latitude>
65.0110523987453
</latitude>


<altitude>
0.0
</altitude>


<floor>
0
</floor>


</location>


</userAttributeChangeEvent>


</body>

</logentry>

Rotuaari: Location data


Following data was collected from the 1st field test


28.8
-
30.9.2003, ~200 users, log file’s size 14.7 MB (763367 lines)


18 shops created mobile ads

<logentry>


<header>


<date>
30
-
09
-
2003
T
14
:
29
:
44
</date>


<module>


<name></name> <version></version>


</module>


<session>


<id>
216
</id>


<username>
seppo
</username>


</session>


</header>


<body>


<userAttributeChangeEvent>


<location>


<longitude>
25.468917078116988
</longitude>


<latitude>
65.0110523987453
</latitude>


<altitude>
0.0
</altitude>


<floor>
0
</floor>


</location>


</userAttributeChangeEvent>


</body>

</logentry>



<logentry>


<header>


<date>
30
-
09
-
2003T14:22:32
</date>


<module>


<name></name>


<version></version>


</module>


<session>


<id>
216
</id>


<username>
seppo
</username>


</session>


</header>


<body>


<userAttributeChangeEvent>


<flyer_received>


1061904953746


</flyer_received>


</userAttributeChangeEvent>


</body>

</logentry>



Phases of Data Visualization

Raw Data

Loaded Data

Load Subset

Loaded Data

Active Operation

Active Data

Execute

Show in UI

Bound

Number of location measurements inside a
cell is presented by a color


3077
measurements
made inside the
most crowded cell


User studies the
range [1, 100] :

100 measurements
gives the maximum
color (red)

Examples for processing 3D
-
acceleration

Distinguishing a Robot from a Human, User
Identification (
1
/
4
)

Distinguishing a Robot from a Human, User
Identification (2/4)


Construct templates
for different actors in
the environment

Human

Robot

s
1

s2

s3

s
4

s5


Pattern
matching
(segmentation
) using
piecewise
linear model
and SSMM
method


Distinguishin a Robot from a Human, User
Identification (3/4)


Decide which actor
is moving in the
environment


Trained

Classifier

Robot

Human

?


If human, perform user identification


User Identification (4/4)


Calculate the distiguhishing
features



Identify


After Finding the Interesting Information


Choose the best model


evaluation, train and test


Representation of Information ?


Personalize


user sets all the preference, user is shown the updated context and is
allowed to choose the actions or the application actively changes its
functionality based on context


Predict


Implement


Issues


Confidence of the recognition


Visualization of the situation


Let the user teach the device or the environment



Data Refinement for Data Reserves


Novel methodology to solve signal synchronization, fusion and feature
selection/dimensionality reduction and preprocessing online data streams
(available data defined in the introduction).


Common denominators for different situations in the data preprocessing
pipes, to enable the
reusage

of software and algorithms.


Error models for sensory equipment to enable quick feedback for/from the
data produces or device manufacturers.


Refined data for the data reserves.


Future Directions


Data streams


Smart
Archiving, compressive sensing


Online segmentation


Online algorithms


Adaptive models



Reliability



Plan carefully (placement of sensors, sampling frequency and resolution,
calibration, method selection)


Introduce the error


In system level


Fast prototyping (Davies, Pervasive Computing)


Develop for critical situations (war zones, refugee camps), utilize expert
knowledge


Share the code


Interdisciplinary research


linguistics, sociology, arts, etc.


Tools


Statistical Data Mining Tutorials


Andrew Moore, Carnegie Mellon,
http://www.cs.cmu.edu/~awm/


Matlab


Filtering, data preprocessing


Neural Network Toolbox


Bayes Net Toolbox


Hidden Markov Model Toolbox


WEKA


MIT’s LNKnet


neural network, statistical, and machine learning classification, clustering, and feature
selection algorithms


The Hidden Markov Model Toolkit (HTK) , Cambridge University


B
-
Course, HIIT, Helsinki, http://b
-
course.cs.helsinki.fi/


SPSS, SAS


statistical analysis


classification trees


Clementine


CommonGIS


Thank You!

msp@ee.oulu.fi

http://www.ee.oulu.fi/isg