DATA MINING REPORT PHASE (1) Lamiya El_Saedi 220093158

levelsordΔιαχείριση Δεδομένων

20 Νοε 2013 (πριν από 3 χρόνια και 8 μήνες)

67 εμφανίσεις


1.1
: Introduction


1.2
: Descriptions



1.2.1
: White wine description



1.2.2
: Brest Tissue description


1.3
: Conclusion





In this phase we discuss the first step in data mining
PREPROCESSING

on two datasets.


The first one is an CSV file talked about

White Wine
, and the other is
an XLS file talked about
Brest Tissue
.


We work on Rabid Miner program.


In this phase we will use plot data to understanding, find the outlier in
data cleaning.


Remove attribute (columns) which are not related to each other,


set roles to convert target class from regular to label in data
transformation.


And using sampling from large data in data reduction.


Methods:



1
-

Discretize

process:


In this method we choose quality as target class
which is take values from 0 to 10 to represent
quality of white wine from bad to excellent as a
new classification.


We added four classes :

Bad from

infinity to 3

Good from 4 to 5

Very good from 6 to 7

Excellent from 8 to 10


Figure
1.2.1.1
: the model of
discretize

process


Figure
1.2.1.2
: the output of
discretize

method



Figure
1.2.1.3
: Sample process and Remove
correlate attribute on white wine dataset



Figure
1.2.1.5
: result of sample process and
remove correlation attribute on white wine
dataset



Figure
1.2.1.6
filter example process on white
win dataset


Figure
1.2.1.8
: sweet white wine based on Syria
measurements



Figure
1.2.1.7
: non sweet white win based on
Syria measurements



Figure
1.2.2.1
: outlier process on Brest tissue
dataset


Figure:
1.2.2.2
plot outlier method on Brest tissue
dataset


Figer:
1.2.2.3
the row of outlier data


Figure
1.2.2.4
: remove correlated attribute from
Brest tissue dataset

Figure
1.2.2.5
: the remain attribute after execute the
remove correlation process from Brest tissue


1.
Preprocessing phase is very important to prepare your
data for next phases, and be comfortable your data are
correct.


2
. You must input your data set as it is extension type


3
. When input the attribute you must choose correct data
type to work on it with more flexibility.


4
. Methods maybe not satisfy for other data set, because
each data set has specific characteristics.


5
. if you have a sample process in a model every time you
can get a deferent results.