Master(Science)
2005
:
(1a) Explain the meaning of data mining.
Data mining (knowledge discovery from data)
Extraction of interesting (
non

trivial,
implicit
,
previously unknown
and
potentially useful)
patterns or knowledge from huge amount of data
(1b) Describe briefly the KDD (Knowledge Discovery in Databases)
Process.
Data cleaning and preprocessing: (may take 60% of effort!)
Data reduction and transformation
Find useful features, dimensionality/variable reduction, invariant representation
Choosing the mining algorithm(s)
Data mining: search for patterns of interest
Pattern evaluation and knowledge presentation
visualization, transformation, removing redundant patterns, etc
.
Use of discovered knowledge
(1c)
List 6 sample methods commonly
used in data mining.
Mining methodology
Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web
Performance: efficiency, effectiveness, and scalability
Pattern evaluation: the interestingness problem
Incorporation of
background knowledge
Handling noise and incomplete data
Parallel, distributed and incremental mining methods
Integration of the discovered knowledge with existing one: knowledge fusion
3b.
Briefly outline the major steps of decision tree classification.
Basic algorithm (a greedy algorithm)
Tree is constructed in a top

down recursive divide

and

conquer manner
At start, all the training examples are at the root
Attributes are categorical (if continuous

valued, they are discretized in advance)
Examples are p
artitioned recursively based on selected attributes
Test attributes are selected on the basis of a heuristic or statistical measure (e.g.,
information gain)
Conditions for stopping partitioning
All samples for a given node belong to the same class
There ar
e no remaining attributes for further partitioning
–
majority voting is employed
for classifying the leaf
There are no samples left
4a.
The k

Means algorithm relies on iterating between two steps. List these two steps
succinctly?
Assign each data point t
o the closest cluster centre (centroid). That data point is now a
member of that cluster.
Calculate the new cluster centre (the geometric average of all the members of a certain
cluster).
4b.
Name two computational limitations of the k

Means clustering
algorithm?
Applicable only when
mean
is defined, then what about categorical data?
Need to specify
k,
the
number
of clusters, in advance
Unable to handle noisy data and
outliers
Not suitable to discover clusters with
non

convex shapes
5b.
The association rule
partition
(not apriori) method divides the data set to mine into
p
sets (
D
1
…
D
P
) applying a modified apriori algorithm to each. What frequent itemset
property does this algorithm
exploit?
Any subset of a frequent itemset must be frequ
ent
Not answered:
3a.
What is the difference between discrimination and classification? Between
characterization and clustering? Between classification and prediction? For each of these pairs
of tasks, how they are similar?
4c .
Consider applying k

Me
ans to a dataset that consists only of binary variables. How could you
calculate distances between a centroid and an instance.
4d.
What is the objective function of the k

Means algorithm.?
4e.
Name one way that the clusters found by the agglomerative
clustering algorithm differ to those
found by the k

means clustering algorithm?
5a.
A fundamental assumption of basic classification algorithms is that the training and test set data
distributions are
stationary
. What does this mean?
5c.
For a given dataset where minimum support is
and minimum confidence is
an association
rule algorithm finds the association rule A
B and B
C. Write these association rules as
bounded conditional and joint probabilities?
Master(Science)
2006
:
Ques
tion 2
:
1

What is meant by an ‘outlier’? Of the following set of values, which is the outlier?
{0, 0.2, 0.5, 0.6,−0.1, 42, 0.67}.
Outlier: Data object that does not comply with the general behavior of the data
.
Outlier value is :42.
2. What is the purpose
of a ‘test set’? Give one advantage and one disadvantage of using a large test set.
Test set is used to estimate the accuracy of the classification rules.
Accuracy rate is the percentage of test set samples that are correctly classified by the model
3. W
hat is the difference between ‘supervised’ and ‘unsupervised’ learning? Name an unsupervised learning
algorithm and give an example of how it can be used in practical applications.
Supervised learning (classification)
Supervision: The training data
(observations, measurements, etc.) are accompanied by labels
indicating the class of the observations
New data is classified based on the training set
Unsupervised learning (clustering)
The class labels of training data is unknown
Given a set of measuremen
ts, observations, etc. with the aim of establishing the existence of
classes or clusters in the data
4. Explain the difference between ‘nominal’ and ‘continuous’ attributes. Give TWO examples of each type of
attribute.
Nominal Variables
:
A generalization
of the binary variable in that it can take more than 2 states,
e.g., red, yellow, blue, green
Continuous Ordinal Variables
:
e.g., gold, silver, bronze
Comments 0
Log in to post a comment