Theoritical_questionsx - yimg.com

siberiaskeinΔιαχείριση Δεδομένων

20 Νοε 2013 (πριν από 3 χρόνια και 11 μήνες)

153 εμφανίσεις

Master(Science)

2005
:


(1a) Explain the meaning of data mining.

Data mining (knowledge discovery from data)

Extraction of interesting (
non
-
trivial,

implicit
,
previously unknown

and
potentially useful)

patterns or knowledge from huge amount of data

(1b) Describe briefly the KDD (Knowledge Discovery in Databases)

Process.




Data cleaning and preprocessing: (may take 60% of effort!)



Data reduction and transformation



Find useful features, dimensionality/variable reduction, invariant representation



Choosing the mining algorithm(s)



Data mining: search for patterns of interest



Pattern evaluation and knowledge presentation



visualization, transformation, removing redundant patterns, etc
.



Use of discovered knowledge



(1c)

List 6 sample methods commonly
used in data mining.



Mining methodology



Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web



Performance: efficiency, effectiveness, and scalability



Pattern evaluation: the interestingness problem



Incorporation of
background knowledge



Handling noise and incomplete data



Parallel, distributed and incremental mining methods



Integration of the discovered knowledge with existing one: knowledge fusion

3b.


Briefly outline the major steps of decision tree classification.

Basic algorithm (a greedy algorithm)



Tree is constructed in a top
-
down recursive divide
-
and
-
conquer manner



At start, all the training examples are at the root



Attributes are categorical (if continuous
-
valued, they are discretized in advance)



Examples are p
artitioned recursively based on selected attributes



Test attributes are selected on the basis of a heuristic or statistical measure (e.g.,
information gain)

Conditions for stopping partitioning



All samples for a given node belong to the same class



There ar
e no remaining attributes for further partitioning


majority voting is employed
for classifying the leaf



There are no samples left

4a.


The k
-
Means algorithm relies on iterating between two steps. List these two steps

succinctly?



Assign each data point t
o the closest cluster centre (centroid). That data point is now a
member of that cluster.



Calculate the new cluster centre (the geometric average of all the members of a certain
cluster).


4b.

Name two computational limitations of the k
-
Means clustering

algorithm?



Applicable only when
mean

is defined, then what about categorical data?



Need to specify
k,
the
number

of clusters, in advance



Unable to handle noisy data and
outliers




Not suitable to discover clusters with
non
-
convex shapes

5b.

The association rule
partition
(not apriori) method divides the data set to mine into
p
sets (
D
1

D
P
) applying a modified apriori algorithm to each. What frequent itemset

property does this algorithm
exploit?

Any subset of a frequent itemset must be frequ
ent



Not answered:

3a.


What is the difference between discrimination and classification? Between
characterization and clustering? Between classification and prediction? For each of these pairs
of tasks, how they are similar?

4c .

Consider applying k
-
Me
ans to a dataset that consists only of binary variables. How could you
calculate distances between a centroid and an instance.

4d.

What is the objective function of the k
-
Means algorithm.?

4e.

Name one way that the clusters found by the agglomerative
clustering algorithm differ to those
found by the k
-
means clustering algorithm?

5a.

A fundamental assumption of basic classification algorithms is that the training and test set data
distributions are
stationary
. What does this mean?

5c.

For a given dataset where minimum support is

and minimum confidence is

an association
rule algorithm finds the association rule A

B and B

C. Write these association rules as
bounded conditional and joint probabilities?





Master(Science)

2006
:


Ques
tion 2
:

1
-

What is meant by an ‘outlier’? Of the following set of values, which is the outlier?

{0, 0.2, 0.5, 0.6,−0.1, 42, 0.67}.

Outlier: Data object that does not comply with the general behavior of the data
.

Outlier value is :42.

2. What is the purpose

of a ‘test set’? Give one advantage and one disadvantage of using a large test set.

Test set is used to estimate the accuracy of the classification rules.

Accuracy rate is the percentage of test set samples that are correctly classified by the model

3. W
hat is the difference between ‘supervised’ and ‘unsupervised’ learning? Name an unsupervised learning
algorithm and give an example of how it can be used in practical applications.



Supervised learning (classification)



Supervision: The training data
(observations, measurements, etc.) are accompanied by labels
indicating the class of the observations



New data is classified based on the training set



Unsupervised learning (clustering)



The class labels of training data is unknown



Given a set of measuremen
ts, observations, etc. with the aim of establishing the existence of
classes or clusters in the data


4. Explain the difference between ‘nominal’ and ‘continuous’ attributes. Give TWO examples of each type of
attribute.

Nominal Variables
:
A generalization
of the binary variable in that it can take more than 2 states,

e.g., red, yellow, blue, green

Continuous Ordinal Variables
:

e.g., gold, silver, bronze