A: A set of rules learned from data that are intended to be interpreted in sequence.

desertcockatooData Management

Nov 20, 2013 (3 years and 11 months ago)

128 views


1

CS490/584



Data Mining

Homework 1


NAME ______
Jacob Adams
_______________
_______________

SID ________
800165301
___


Briefly a
nswer
each the following questions
.
(10 points each)


1.

What is a "decision list"?

A:
A set of rules learned from data that are inten
ded to be interpreted in sequence.

2.

What does it mean to say that a set of rules is "complete"?

A:
It means that there is a rule to handle any and every possible combination of
attributes.


3.

Briefly describe the differences between the following approaches
for the integration of a
data mining system with a database or data warehouse system:
no coupling, loose
coupling, semi
-
tight coupling,
and
tight coupling.
State and explain which approach is
most popular
based on the information you can find on the web
.
Be sure to include the
link


A:
No coupling means that the data mining system will not
make
use
of
a database
or data warehouse at all.

The data mining system must perform the cleaning,
organizing
, collecting, and transforming that the database or data war
ehouse.


Loose coupling means that the data mining system will make use of some, but not all
of the operations provided by the database or data warehouse. These systems often
use querying and indexing functionality, but not query optimization.


Semi
-
tight

means that the database or data warehouse are not only linked to the
data mining system, but they
can also perform basic data mining tasks and store
intermediate mining results for in order to improve performance.


Tight coupling means that a data mining
system is completely and smoothly
integrated into the database/data warehouse system, such that the entire system is
considered one functional component


Tight coupling is preferred to loose coupling since it provide higher performance
.

One study even sho
wed that tight coupling had almost a two times performance
advantage over loose coupling.

http://www.almaden.ibm.com/cs/projects/iis/hdb/Publications/papers/k
dd96_udf.pd
f


4.

Write a rule based on the following whether data.

Note that your rule should
(a)
correctly
classify one or more of the instances
and (b) not misclassify any instance
.


outlook

temperature

humidity

W
indy

play

outlook


2

sunny

Hot

high

FALSE

no

sunny

sunny

Hot

Low

TRUE

Yes

sunny

sunny

Normal

high

TRUE

No

sunny

overcast

Normal

normal

FALSE

yes

overcast


A:
If humidity = high then play=no


5.

D
iscuss the differences and similarity between a data warehouse and a database.


A:
Both are databases an
d both store data.
Regular d
atabases are

intended to store
the current state of the data, so both reads and writes are allowed. Data warehouses
are designed for querying and analysis. They can contain data from several
databases and over can contain the
state of the data over a period of time. After data

is entered into
data warehouses
, it is typically non
-
volatile
.


6.

Recent applications pay special attention to spatiotemporal data streams. A
spatiotemporal data stream

contains
spatial

information that cha
nges over time, and is in
the form
of stream data, i.e., the data fl
owin
g in
-
and
-
out like possibly infi
nite streams.



(a)

Present
an application example

of spatiotemporal data streams.


A:
Highway traffic


(b)

Discuss what kind of interesting knowledge can be mi
ned from such data streams,
with limited time

and resources.


A:
Outlier detection, anomaly detection, rare event detection, surprising
patterns, concept drifting, emerging events


(c)

Identify and discuss the major challenges in spatiotemporal data mining.


A
:
There is a large amount of data constantly being created. This means that
the processing either has to be limited or very efficient. The data is also
coming from an array of different places which may be changing. This
means that some data sources may
be slow to report or completely
nonexistent. The data mining system needs to be flexible enough to handle
this.


(d)

S
ketch a method to mine one kind of knowledge from such stream data

e
ffi
ciently.


A:
If there were several speed sensors at various points alo
ng a highway, you
could monitor the average speed of the traffic at each given section. All that
would have to be stored is the current average and the number of instances
for
e
ach location. When a new instance at that location is recorded, a new
average

can easily be calculated and restored.

Short term averages, such as
over the last minute or hour, could also be kept in similar fashions.

Unusually fast or slow times could be stored in a separate table for analysis

3

later on. The average speed in each se
ction could
also
be used to dynamically
change the speed limit in adjacent sections of highway.


http://www.academypublisher.com/jcp/vol01/no03/jcp01034350.pdf

http://www.springerlink.com/content/k3hq90812024777m/

http://www.cs.purdue.edu/research/technical_reports/2006/TR%2006
-
020.pdf

http://www
-
users.cs.umn.edu/~mokbel/demos/PlaceDemo5.pdf


You are encouraged to use info from the web for this question. Be sure to inc
lude the
link.


Due Friday, January 23
rd

at 10 AM.
Submit a softcopy on Moodle
@
classes.cs.siue.edu