Analysis of Models

lovethreewayAI and Robotics

Oct 20, 2013 (3 years and 9 months ago)


Analysis of Models

Data Set Description

Within the data set, there are thirteen variables. The following is the description of each variable,
with the variable’s name as it appears in the data set:

Variable Abbreviation

Variable Name



Raw Identification

Identification number


Ad Relevance

Number of tokens overlapping
between query and ad, divided
by number of tokens in query


Advertiser Ad Count

Experience level of
advertisers, measured by the
number of ads
created in the


Advertiser Description

Per Ad

Number of descriptions for the


Advertiser Average Hit

Average ratio between the
clicks and number of



Number of ads displayed to
user in a




The order of the ad in the
impression list


User Clickiness

Average click through rate of
the user


Ad User Reach

Number of users that the ad is
displayed to following a query



Average click through rate for
the ad


Click Through Rate

Number of times that the ad is


User Description

Number of times a particular
description is displayed for the


User Ad Impressions

Number of times an ad is
displayed to the user

1 Tokens mean the number of words in the query or advertisement


Data Stream:

Principle Components Analysis

Prior to the principle components analysis (PCA), the
outliers and extremes in the data set were

The click through rate was used as the target and the raw identification was set to an
identification number variable.

Additionally, the data set was placed through a partition with 60
percent of the data

in the training partition and 40 percent in
the testing. T
he following
components were found:

Component Name



in Component


User Description Impressions

User Ad Impressions

Average Click Through Rate

Advertiser Average Hi
t Rate

Ad Attraction





Ad Relevance

Advertiser Ad Count

Advertiser Descriptions Per Ad


Ad User Reach


User Clickiness

Neural Network Analysis

Utilizing the co
mponents found in the PCA, the following neural network model was created:

From this neural network, the most important predictor is the reach of the ad. It also indicates
that average click through rate, clicks, relevance, and impressions are important elements in
predicting click through rates. Additionally, placement of the ad

is slightly relevant. Judging by
the lift charts, this model is better than randomly predicting click through rates, meaning that the
model is useful in prediction.


Utilizing the principle comp
onents analysis, the following CHAID tr
ee was cre

The first variable used in the CHAID tree to break down the predictions is impressions. From the
impressions, there are four segments that it breaks down into, which include those that are less
0.231 (the z
score value of impressions),

0.231 to
0.081 to 0.687, and greater than
0.687. Out of these groups, 60 percent of the observations within the data set fell into the less
0.231 impressions group. Using this 60 percent of the data, the model was further broken
down using

relevance, then clicks, and then relevance again. Other nodes in the impressions part
of the tree, were broken down by variables such as relevance, reach, clicks, and average click
through rates. Out of the predictors used in the model, the most important

variable used was
relevance. Comparing this model to the neural network, relevance is even more important than
reach was in the neural network. However, the model does indicate that reach is the next most
important variable to use as a predictor.

ing the lift chart for the model, this model is
better than predicting click through rates randomly.

Feature Selection

Separate from the PC
A, following the removal of outliers and extremes as well as after

implementing a partition
, a feature select node
was used. From this, the only predictors excluded
were depth, user clickiness, and advertiser ad count. Depth and user clickiness, were close to a
value of one and in order to test their significance, each of the following models were run with
them include
d as well, but the results did not deviate substantially from the models without these
variables. The following models do exclude depth, user clickiness, and advertiser ad count.

Two Step

The first model used following the feature selection was the two st
ep clustering. The following
clusters were identified:

From this model, three clusters were formed.
Cluster three is the largest, with


percent of
the data, cluster one has 29.9 percent, and cluster

has 9.3 percent.
Cluster three
relies on ad
user reach and ad relevance primarily. Cluster
one strongly relies on ad relevance and also
relies on ad user reach. Cluster two relies
strongly on position as well as on ad relevance.

To consider the significance of the cluster findings, the predictor importance was evaluated.
From the evaluation, user ad impressions, user description impression, ad attraction, ad user
reach, position, advertiser average hit rate, advertiser description

per ad, and ad relevance were
all strong predictors, with equal importance. Click through rate was also important, but not as
important as these other components. To further evaluate this model, a statistical correlation was
used, which shows that user de
scription impressions, user ad impressions, and ad relevance are
the strongest correlations in the model, all of which are statistically significant. However, the
actual strength of these correlations is minimal.

Gen Lin

The second model used following th
e feature selection was the Gen Lin. The following shows
the results from the model:

In this model, the beta coefficients are all small, indicating that these variables have a minimal
effect on predicting click through rates. Testing the significance o
f the coefficients, all are
significant at a 0.001 level, with the exception of ad user reach, which would only be significant
at the 0.05 level of significance. The most important predictors in the model were user
description impressions and user ad impre
ssion. Ad relevance was also important.


The third model used following the feature selection is the regression model. The following
indicates the impact of the variables within the model:

Judging by the beta coefficients in the model, it appears that each of the variables have very little
impact on the likelihood of the ad being clicked. However, for the relationships suggested by the
model, all of the variables except for ad user reach are

statistically significant at the 0.001 level.
In this model, the most important predictor for click through rates is the user description
impressions. User ad impressions and ad relevance are also important factors for the model.


After evalu
ation of the principle components analysis models, the most important predictors to
use in predicting click through rates are impressions and reach. The impression component
includes user description impressions and user ad impressions, whereas reach inclu
des ad user
reach. The impressions component was only present in the CHAID tree, but the importance of
this component was far more than reach in either the CHAID tree or the neural network. Reach
was an important predictor in both the CHAID tree and neural

network models. Both the CHAID
tree and neural network models had significant lift on the lift charts, meaning that the models are
better than predicting at random.

The feature selection models included a two step, Gen Lin, and regression. The two step
regression created three clusters, each of these clusters were created primarily using relevance
and two of the clusters included ad user reach. Within the two step, the most important predictors
were user ad impressions, user description impressions, ad a
ttraction, ad user reach, position,
advertising average hit rate, advertising description per ad, and ad relevance. For the Gen Lin the
most important predictors were user description impressions and user ad impressions. In the
regression model, the most i
mportant predictors were user description impressions and user ad
impressions. Out of these three models, the common important predictors were user description
impressions and user ad impressions.

Comparing the evaluation of the principle components anal
ysis models with the feature selection
models, the similarities were that impressions in terms of the user description and user ad
impressions were the most significant variables. This is supported by all of the models used.
Reach was another important var
iable when considering the PCA models, but the feature
selection models were mainly indicated user description impressions and user ad impressions as

For SOSO to incorporate these findings in its strategic marketing plans, it needs to display
same descriptions of advertisements to users and the same advertisements in order to increase ad
click through rates. Frequency of impressions is most important, judging by the models used in
this study. Considering the lift charts previously evaluated
, these models are better than
predicting click through rates randomly.