Project Paper

attackkaboomInternet και Εφαρμογές Web

2 Φεβ 2013 (πριν από 4 χρόνια και 9 μήνες)

154 εμφανίσεις


1

Desktop Weather Usage Mining

By Mark Gunnels














































2


Desktop Weather Usage Mining

................................
................................
...........

1

1.

Introducti
on

................................
................................
................................
....

3

1.1.

The Environment

................................
................................
....................

3

1.2.

Goal of the Project

................................
................................
..................

3

1.2.1.

Fi
rst Client Meeting

................................
................................
..........

3

1.2.2.

Second Client Meeting

................................
................................
....

4

1.2.3.

Third Client Meeting

................................
................................
........

4

1.2.4.

Fourth Meeting

................................
................................
................

5

1.2.5.

Final Goal Statement

................................
................................
.......

5

2.

Data

................................
................................
................................
...............

7

2.1.

Log Files

................................
................................
................................
.

7

2.2.

Data Warehouse

................................
................................
.....................

7

3.

Web Usage Mining

................................
................................
........................

9

3.1.

Pre
-
Processing

................................
................................
.......................

9

3.1.1.

Usage Preprocessing

................................
................................
......

9

3.1.2.

Content Preprocessing

................................
................................
....

9

3.1.3.

Structure Preprocessing

................................
................................
..

9

3.2.

ETL Scripts

................................
................................
.............................

9

3.3.

Pattern Discovery

................................
................................
.................

10

3.3.1.

Association Rules

................................
................................
..........

10

3.3.2.

Clustering

................................
................................
......................

11

3.3.3.

Classification

................................
................................
.................

11

3.3.4.

Sequential Patterns

................................
................................
.......

11

4.

Appendix A

................................
................................
................................
..

12

5.

Appendix B

................................
................................
................................
..

14

6.

References

................................
................................
................................
..

19


3

1.

Introduction

1.1.

The Environment

To expand its marketing reach and compete with other weather services such
as Weather Bug, The Weather Channel created a
desktop application called
Desktop Weather that presents information about the weather and its affect
on weather dependent activities such as golf, baseball, and traffic. To
distribute Desktop Weather, The Weather Channel has lined up several
distribution
partners such as NASCAR, Major League Baseball, Google, and
Yahoo. Each distribution partner is paid a fee for every customer who
downloads Desktop Weather.


The Weather Channel makes money from Desktop Weather on
advertisements. Desktop Weather continuou
sly runs on the user's personal
computer in one of two modes
--

maximized, for viewing the weather
information, and minimized, where it sits in the system tray displaying the
current temperature. When maximized, Desktop Weather displays
advertisements deli
vered by OASIS, an open source advertisement delivery
server. Essentially two types of advertisements are delivered to Desktop
Weather users
--

generic advertisements which are selected at random and
targeted advertisements which are selected based on the
content being
viewed. Each advertisement displayed on Desktop Weather earns The
Weather Channel a fixed amount of money, with targeted advertisements
being more lucrative.


To maximize the potential of Desktop Weather, The Weather Channel has
recently ad
ded mechanisms to Desktop Weather to track its use. Every action
a user takes results in Desktop Weather sending a URL request to a server
which logs the action, user id, and other parameters to a plain text log files.
To make this data actionable, The Wea
ther Channel set up a database to
house a subset of the logged data and developed a rough reporting tool
c
alled Analyticus. Unfortunately,
Analyticus has proven to be too rudimentary,
not housing granular enough information to answer important usage
questi
ons and lacking visualization functionality for the information that is
present. In some cases, Analyticus has provided faulty or misleading
information.


1.2.

Goal of the Project

The project's goal evolved over
four

client meetings.

1.2.1.

First Client Meeting

Re
presentatives from The Weather Channel's Software Development
Department, Product Management Department, and Upper Management
Department attended. The meeting centered introducing Desktop
Weather, OASIS, Analyticus, and each department's contributions to

4

D
esktop Weather. During the discussions, it became apparent that a
concrete goal for the project had not been decided upon. Essentially, the
department representatives expressed that information that would assist in
making Desktop Weather a success was capt
ured in Analyticus but they
didn't know how to get it out and didn't know exactly what they wanted.
Additionally, the resource who best knew Analyticus wasn't invited to the
meeting so his input was lacking during this crucial discussion.


After much deli
beration, Upper Management expressed a desire to
understand why their customer churn was so high. The Weather Channel's
definition of churn for Desktop Weather is a user who has been absent
from the logs for 30 days. Downloads of Desktop Weather were at an

all
-
time high but it appeared that customers were churning off at the same
rate that they were downloading so the customer base seemed to be in a
rough equilibrium. After more discussion, it was learned that the 30 day
churn definition was an arbitrary on
e and that Upper Management wasn't
sure that it was valid.


The first action items it was decided was for Mark Gunnels to get familiar
with Analyticus and to validate the 30 day churn definition. The second
action item would be to provide some form of clus
tering or classification
around churned versus non
-
churned users.


Lessons learned and observations made from this meeting include:



Ensure that the proper resources are available during the first
meeting. In the case of data mining and business intelligenc
e
discussions, it is critical that someone who understands the data
and how it currently housed and accessed is available.



The first client meeting outline provided in Dr. Nargundkar's slides
makes a very good agenda.



Bring your own agenda. The clients mig
ht not have one.



Business Intelligence and Data Mining projects are very different
from other IT projects. Other IT projects often have their goals
predefined. Business Intelligence and Data Mining projects may
not.

1.2.2.

Second Client Meeting

Touring Analyticus

filled most of the second meeting including reviewing
its schema and its existing reports. The Analyticus tables and the log files
for August and September were provided to Mark Gunnels.

1.2.3.

Third Client Meeting

Mark Gunnels presented his investigations into
the current customer churn
definition. His findings were as follows:



Analyticus data was not correct. Its counts didn't necessarily
correspond to the data in the log files. Even if the churn definition
was valid, the numbers informing decisions were not. T
his, of

5

course, was indicative of a larger problem.



The value of a customer did not lay with in the length of his use of
Desktop Weather but in the number of ads he viewed. Analyticus
failed to harvest ad events from the log files. It only captured
install
ation and automatic events. These events did not truly
represent an active customer. For instance, one user who was
deemed by Analyticus as “active” all two months investigated
triggered no advertisement events and therefore was less lucrative
than a custo
mer who was active for only two days but had
seventeen advertisement events.


A more accurate categorization of customers was not decided upon but
two promising possibilities are:



A categorization based on their lucrativeness.



A categorization based upon t
heir activity levels.


Mark Gunnels then presented his idea for a data warehouse that would
house all Desktop Warehouse events generated and allow for greater
accuracy and granularity than Analyticus. Additionally, he reviewed other
types of analysis that
would be made possible with this data warehouse.
During this meeting,
The Weather Channel
acknowledged
Analyticus a
s a

dead end and scheduled another meeting with Mark Gunnels and their
System Administration staff to put
the Data Warehouse

in place.

1.2.4.

Fourth

Meeting

The Weather Channel arranged for a powerful HP
-
UX box to be built out
house the data warehouse which will be enabled by MySQL an open
source database. During this meeting, The Weather Channel also
committed to other goals beyond user classificatio
n. Those goals are
discussed in the final goal statement section.

1.2.5.

Final Goal Statement

It became apparent over the course of the four meetings
that,

broadly
stated
,

The Weather Channel was interested in:



Who is using Desktop Weather,



How are they using
Des
ktop Weather
,



And are we making money?


Starting with that broad understanding,
Mark Gunnels performed research
into Data Mining Project Models that best addressed the stated needs.
After much consideration,
Web Usage Mining appeared to best fit the
desire
s e
xpressed by The Weather Channel. Though Desktop Weather is
not a true browser based application, its structure and logging mechanism
directly mirrors a web site so it was agreed that Web Usage Mining is
appropriate. The final, project goal statement, ta
ken directly from a paper
on Web Usage Mining, is:

Analyzing and exploring regularities in the behavior of users

6

accessing a web site can improve system performance, enhance
the quality and delivery of Internet information services to the end
user, and ide
ntify population of potential customers for electronic
commerce. [8]


The development steps described in [5], [6], [7], and [8] have come to
serve as the project plan.



7

2.

Data

2.1.

Log Files

The log files that will inform this analysis constitute about 84 Gigabyt
es of
text. Each line of the log file captures one Desktop Weather generated action.
Each log file contains about 76,000 lines. For each day, 96 log files are
generated.


The format of a single line takes the following format:

ip address of user's pc^unix
system date^/product id/action id?id=user id
&various
other name value pairs


For example, the following is a line from the first file processed:

217.219.5.3^1129694400^/0/29?id=1012824020&cobrand=freeze&instby=freeze3&reg=1&loc
=USAK0173&loctype=1&ver=4.250
6&rnd=19555


As can be seen, the data captures four important elements in Usage Mining
--

the con
tent

being viewed, the s
tructure

of the applications content (which can
be inferred by a series of actions), the usage by a particular user, and
provides infor
mation about the user.
[5]


2.2.

Data Warehouse

The Data Warehouse that will be the repository of the information contained
in the log files plus other information culled from outside sources will have the
following structure:



8

Event
_
Fact
PK
,
FK
2
product
_
key
PK
,
FK
1
customer
_
key
PK
,
FK
4
action
_
key
PK
,
FK
3
date
_
key
PK
,
FK
5
time
_
key
Customer
_
Dimension
PK
customer
_
key

age

year
_
of
_
birth

city

state

county

zip

paid
_
subscriber
Product
PK
product
_
key

product
_
type

version

cobrand

instby
Action
_
Dimension
PK
action
_
key

description

monetization

vertical
_
view
date
_
dimension
PK
date
_
key

mysql
_
date

day
_
of
_
week

day
_
number
_
in
_
month

day
_
number
_
overall

week
_
number
_
in
_
year

month

quarter

holiday
_
flag

weekday
_
flag

season

year
time
_
dimension
PK
time
_
key

mysql
_
time

hour

minute

second

time
_
of
_
day


Th
e SQL DDL can be found in Appendix A. This star schema origina
ted from
careful studies of [1],
[2],

[3]
, and [4]
.




9

3.

Web Usage Mining

Web Usage Mining has three phases:



P
re
-
processing,



P
attern discovery,



A
nd

patterns analysis.

[5]


This section will desc
ribe the pre
-
processing steps already undertaken and the
planned
Pattern discovery.

3.1.

Pre
-
Processing

To perform Web Usage Mining, three types of pre
-
processing must be
performed:



Usage Preprocessing,



Content Preprocessing,



And Structure Preprocessing. [5]

3.1.1.

Us
age Preprocessing

Usage Preprocessing involves identifying which particular actions belong
to which user. Normally the most difficult step because HTTP is a
stateless environment with only cookies serving to identify a customer
session, it proved to be eas
y in Desktop Weather’s case because the
customer identifier is built into the logging format.

3.1.2.

Content Preprocessing

Content preprocessing involves classifying content based on “its
topics or
intended use.” [5] To enable content preprocessing, Mark Gunnels

went
through and tagged each Desktop Weather Action with both a content tag
and intended use tag which will be stored in the vertical field of the Action
table.

3.1.3.

Structure Preprocessing

Different methods are being investigated to enable structure
preproces
sing. Some form of directed graph data structure will most likely
be used.

3.2.

ETL Scripts

To peform the pre
-
processing, several

Extract, Transform, and Load (ETL)
scripts were
developed. The ETL scripts are
written in Ruby, an open
-
source
programming language
, and make use of ActiveRecord from the Rails
framework to ease committing information into the data warehouse. The ETL
scripts are listed in Appendix B.


In particular, the ETL Scripts perform the following functions:



Cycle a set of directories and evalu
ating each log file in the set of
directories.


10



Parse the individual Action Events and store them in their respective
tables.



Log Invalid Action Events for later review. (Note: One bug in the
Desktop Weather's Action Event mechanism has been located via thi
s
logging.)



Classify each action event into the taxonomy discussed in the Data
section.



On action events that contain the location information of the user, the
script performs necessarily translation on the location codes using the
Location table. Addition
ally, the script screen scrapes information from
Yahoo that provides interesting information on that location such as:

o

Population

o

Median Age

o

Cost of Living Index

o

Average Winter Temperature

o

Air Quality Index


One improvement that is currently being engineer
ed into the ETL scripts is
to take advantage of some idle desktops that will be loaned to the initial
loading effort. The mechanism is Ruby Queue which “provides an
extremely simple, economic, and easy
-
to
-
understand tool that harnesses
the power of many CP
Us while simultaneously allowing researchers to
shift their focus away from the mundane details of complicated distributed
computing systems and back to the tas
k of actually doing science.” [9
]

Essentially, Ruby Queue implements the Tuple Master
-
Worker des
ign
pattern. One desktop, the master, will read the data files and place
individual Action Events into a shared memory for the other desktops, the
workers, to retrieve an individual action, process the action, and commit
the action back to shared memory.


3.3.

Pattern Discovery

After

the data pre
-
processing is complete, the
work on
pattern discovery
phase
may begin
. The literature suggests several data mining methods and
algorithms as appropriate for processing usage information. They include:

3.3.1.

Association Rules

Often referred to as Market Basket Analysis, Association Rules examines
a list of “transactions”, in the case of Desktop Weather that content
accessed by a user in a particular session, and determines which items
are most often occur together with “
a supp
ort value
exceeding some
specified threshold.”


Performing Association Rules Analyses will allow The Weather Channel to
understand what content and services of Desktop Weather is most often
used jointly and “m
ay serve as a heuristic for prefetching documen
ts in
order to reduce perceived latency.
”[5] [10] provides a very capable

11

Association Rules code set that produces results such as:

87 rules with support higher than or equal to 0.400 found.

supp conf rule

0.888 0.984 engine
-
location=front
-
> fuel
-
type=g
as

0.888 0.901 fuel
-
type=gas
-
> engine
-
location=front

0.805 0.982 engine
-
location=front
-
> aspiration=std

0.805 0.817 aspiration=std
-
> engine
-
location=front

0.785 0.958 fuel
-
type=gas
-
> aspiration=std


3.3.2.

Clustering

Clustering is “the
partitioning

of a data set into
subsets

(clusters), so that
the data in each subset (ideally) share some common trait
-

of
ten
proximity according to some defined
distance measure
.” [11] T
wo
clusters
that will serve the project goals
are
:



Usage Clusters

to “es
tablish groups of users

exhibiting simlar
browsing patterns
”. [5]



Page Clusters

to establish “[g]
roups
of pages having related
content”. [5]


Though this might seem duplicative of Association Rules Analysis and
Classification Analysis, Cluster Analysis often uncovers unexpected
results because it is often entered into with no preconceived notions.

3.3.3.

Classification

Once the dependent variable by which to classify users is arrived at,
classification analysis will be performed to
generate a set of rules that will
allow The Weather Ch
annel to map their customer base to “
one of several
predefined classes

.


[12] provides a set of Classification Algorithms that generate such output
as:

J48

pruned

tree

------------------


outlook

=

sunny

|

humidity

<=

75:

yes

(2.0)

|

humidity

>

75:

no

(3.0)

outlook

=

overcast:

yes

(4.0)

outlook

=

rainy

|

windy

=

TRUE:

no

(2.0)

|

windy

=

FALSE:

yes

(3.0)


3.3.4.

Sequential Patterns

Sequential Pattern Analysis “
attempts to find inter
-
session patterns such
that the presence of a set of items followed by anot
her item in a time
-
ordered set of sessions or episodes

.

[5] This analysis
predict
s

future visit
patterns which will be helpful in placing advertisemen
ts aimed at certain
user groups and preloading content to enhance application performance.


12

4.

Appendix A

DRO
P TABLE IF EXISTS `desktopweatherwarehouse`.`brands`;

CREATE TABLE `desktopweatherwarehouse`.`brands` (


`id` int(10) unsigned NOT NULL auto_increment,


`cobrand` varchar(10) NOT NULL default '',


`instby` varchar(10) NOT NULL default '',


PRIMARY KEY

(`id`)

) TYPE=InnoDB;


DROP TABLE IF EXISTS `desktopweatherwarehouse`.`customers`;

CREATE TABLE `desktopweatherwarehouse`.`customers` (


`id` int(10) unsigned NOT NULL auto_increment,


`age` int(10) unsigned NOT NULL default '0',


`city` varchar(45)
NOT NULL default '',


`state` varchar(45) NOT NULL default '',


`zip` varchar(45) NOT NULL default '',


`paid_subscriber` tinyint(1) NOT NULL default '0',


`gender` char(1) NOT NULL default '',


`install_id` int(10) unsigned NOT NULL default '0',


`l
ocation` varchar(45) NOT NULL default '',


PRIMARY KEY (`id`)

) TYPE=InnoDB;


DROP TABLE IF EXISTS `desktopweatherwarehouse`.`desktop_actions`;

CREATE TABLE `desktopweatherwarehouse`.`desktop_actions` (


`id` int(10) unsigned NOT NULL auto_increment,



`description` text,


`monetization` double default '0',


`true_action_id` int(10) unsigned default '0',


PRIMARY KEY (`id`)

) TYPE=InnoDB;


DROP TABLE IF EXISTS `desktopweatherwarehouse`.`event_dates`;

CREATE TABLE `desktopweatherwarehouse`.`event_da
tes` (


`id` int(10) unsigned NOT NULL auto_increment,


`mysql_date_time` datetime NOT NULL default '0000
-
00
-
00 00:00:00',


`day_of_week` int(10) unsigned NOT NULL default '0',


`day_number_in_month` int(10) unsigned NOT NULL default '0',


`day_number
_in_year` int(10) unsigned NOT NULL default '0',


`month` int(10) unsigned NOT NULL default '0',


`quarter` int(10) unsigned NOT NULL default '0',


`year` int(10) unsigned NOT NULL default '0',


`hour` int(10) unsigned NOT NULL default '0',


`minutes`

int(10) unsigned NOT NULL default '0',


`seconds` int(10) unsigned NOT NULL default '0',


`unix_date_time` int(10) unsigned NOT NULL default '0',


PRIMARY KEY (`id`)

) TYPE=InnoDB;


DROP TABLE IF EXISTS `desktopweatherwarehouse`.`events`;

CREATE TABLE

`desktopweatherwarehouse`.`events` (


`id` int(10) unsigned NOT NULL auto_increment,


`customer_id` int(10) unsigned NOT NULL default '0',


`action_id` int(10) unsigned NOT NULL default '0',


`event_date_id` int(10) unsigned NOT NULL default '0',


`
cobrand_id` int(10) unsigned NOT NULL default '0',


`monetization` double unsigned NOT NULL default '0',


PRIMARY KEY (`id`)

) TYPE=InnoDB;


DROP TABLE IF EXISTS `desktopweatherwarehouse`.`location_master`;

CREATE TABLE `desktopweatherwarehouse`.`locat
ion_master` (


`twc_loc_id` varchar(10) NOT NULL default '',


`loc_type` decimal(6,0) NOT NULL default '0',


`country_cd` char(2) NOT NULL default '',


`city_nm` varchar(45) NOT NULL default '',


`time_zone_cd` varchar(6) default NULL,


`locale_site_
cd` varchar(5) default NULL,


13


`st_cd` varchar(4) NOT NULL default '',


`present_nm` varchar(45) default NULL,


`coop_id` varchar(10) NOT NULL default '',


`latitude` decimal(10,6) NOT NULL default '0.000000',


`longitude` decimal(10,6) NOT NULL defaul
t '0.000000',


`obs_stn` varchar(6) NOT NULL default '',


`prim_tecci` varchar(10) default NULL,


`gmt_diff` decimal(6,2) NOT NULL default '0.00',


`loc_rad` varchar(6) default NULL,


`reg_rad` varchar(6) default NULL,


`reg_sat` varchar(6) default N
ULL,


`cnty_id` varchar(7) default NULL,


`zone_id` varchar(8) default NULL,


`cnty_fips` varchar(5) default NULL,


`active` decimal(6,0) NOT NULL default '0',


`dst_ind` char(1) default NULL,


`dst_active` char(1) default NULL,


`dma_cd` char(3) de
fault NULL,


`climo_stn` varchar(10) default NULL,


`dst_offset` decimal(6,2) default NULL,


`time_zone_nm` varchar(35) default NULL,


`time_zone_abbrv` varchar(6) default NULL,


`sec_obs_stn` varchar(6) default NULL,


`tert_obs_stn` varchar(6) defau
lt NULL,


`zip_2_loc_id` varchar(10) default NULL,


`elevation` decimal(6,0) default NULL,


`closeup_rad` varchar(20) default NULL,


`metro_rad` varchar(20) default NULL,


`ultra_rad` varchar(20) default NULL,


`ss_rad` varchar(20) default NULL,


`l
s_rad` varchar(20) default NULL,


`garden_id` varchar(10) default NULL,


`index_id` varchar(4) default NULL,


`upd_type` varchar(10) default NULL,


`sec_tecci` varchar(10) default NULL,


`tert_tecci` varchar(10) default NULL,


`tecci_enabled` decimal
(1,0) default NULL,


PRIMARY KEY (`twc_loc_id`,`loc_type`,`country_cd`),


KEY `location_master_zip_idx` (`zip_2_loc_id`)

) TYPE=MyISAM;


14

5.

Appendix B

load_actions.rb

load 'Event.rb'

load 'DesktopAction.rb'

load 'Brand.rb'

load 'EventDate.rb'

load 'Event.rb
'

load 'Datawarehouse.rb'

require 'rubygems'

require 'active_record'

require 'fileutils'


#80.191.152.5^1129694400^/0/38?id=1010425242&cobrand=weather&instby=weather1&rnd=1
9568

# Connect to MySQL database

database_spec = {


:adapter => 'mysql',


:host

=> '127.0.0.1',


:database => 'desktopweatherwarehouse',


:username => 'root',


:password => ''

}


# connect to the database

ActiveRecord::Base.establish_connection database_spec


def retrieve_date_time_id(date_time_structure)


existing_date =
Event
Date.find_by_unix_date_time(date_time_structure.unix_date_time)




if not existing_date


existing_date = EventDate.new


#year [, month, day, hour, min, sec, usec]


existing_date.mysql_date_time = Time.gm( date_time_structure.year,



date_time_structure.month,


date_time_structure.day_number_in_month,


date_time_structure.hour,


dat
e_time_structure.minutes,


date_time_structure.seconds)


existing_date.day_of_week = date_time_structure.day_of_week


existing_date.day_number_in_month = date_time_structure.day_number_in_month


existing_da
te.day_number_in_year = date_time_structure.day_number_in_year


existing_date.month = date_time_structure.month


existing_date.quarter = date_time_structure.quarter


existing_date.year = date_time_structure.year


existing_date.hour = date_time_
structure.hour


existing_date.minutes = date_time_structure.minutes


existing_date.seconds = date_time_structure.seconds


existing_date.unix_date_time = date_time_structure.unix_date_time


existing_date.save


end




return existing_date.id




end



def retrieve_customer_id(install_id, true_action_id, query_params_hash)


current_customer = Customer.find_by_install_id(install_id)




if not current_customer


current_customer = Customer.new


current_customer.install_id = query_params_has
h['id']


if true_action_id.to_i == 27


current_customer.age = query_params_hash['dob']


current_customer.gender = query_params_hash['gender']


#current_customer.location = query_params_hash['']


end


15


current_customer.save


end





return current_customer.id

end


def retrieve_brand_id(cobrand, instby)


brand = Brand.find_by_cobrand_instby(cobrand, instby)




if not brand


brand = Brand.new


brand.cobrand = cobrand


brand.instby = instby


brand.save


end




return b
rand.id

end


#Preload translation table for action to true_action_id



#Record Counter

counter = 1


#Cycle the directory

directory_name = "C:/Junk/WeatherChannel/archive_files"

archive_directory_name = "C:/Junk/WeatherChannel/processed_files/"

Dir.foreach(
directory_name) do |current_file|




if current_file != "." and current_file != ".."


file_name = directory_name + "/" + current_file


else


next


end




#Open File


puts("Processing " + file_name)


File.open(file_name, "r") do |file|





#Cycle Rows : For Each Row


file.each_line() do |line|




puts "Record Number " + counter.to_s


counter = counter + 1




begin


#Create the event record.


event = Event.new




#Split on ^


splits_on_c
arrat = line.split('^')




#Evaluate the date and time.


unix_date_time_info_part = splits_on_carrat[1]


date_time_structure =
Datawarehouse.create_via_unix_timestamp(unix_date_time_info_part)


event.event_date_id = retrieve_date_t
ime_id(date_time_structure)




other_event_info_part = splits_on_carrat[2]




#Split on /


splits_on_forward_slash = other_event_info_part.split('/')


action_id_info = splits_on_forward_slash[2]




#Split on the ?.



puts(action_id_info)


splits_on_question_mark = action_id_info.split('?')




#Evaluate the action id


true_action_id = splits_on_question_mark[0]


16


action = DesktopAction.find_by_true_action_id(true_action_id)




if n
ot action


next


end




event.action_id = action.id


event.monetization = action.monetization




#Split on & and build a hash on the results.


event_parameters = splits_on_question_mark[1].split('&')


action_p
arameters = Hash.new




event_parameters.each do |parameter|


key_value = parameter.split('=')


action_parameters[key_value[0]] = key_value[1]


end




#Evaluate action parameters




#Evaluate customer id



install_id = action_parameters['id']


event.customer_id = retrieve_customer_id(install_id.to_i, true_action_id,
action_parameters)




#Evaluate cobrand and instby


cobrand = action_parameters['cobrand']


if not cobrand



cobrand = "none"


end




instby = action_parameters['instby']


if not instby


instby = cobrand


end




event.cobrand_id = retrieve_brand_id(cobrand, instby)




#Monetization for Partner Installs



if true_action_id == 0


case cobrand


when 'freeze'


event.monetization =
-
0.15


when 'real'


event.monetization =
-
0.6


when 'netscape'


event.monetization =
-
0.33


end


end





event.save


rescue


print "An error occurred: ",$!, "
\
n"


next


end


end


end




FileUtils.mv file_name, archive_directory_name + current_file



end



Brand.rb

require 'rubygems'

require 'active_record'


class Br
and < ActiveRecord::Base


def self.find_by_cobrand_instby(cobrand, instby)


find( :first, :conditions => ["cobrand = ? and instby = ?", cobrand, instby])


end


17

end


Customer.rb

require 'rubygems'

require 'active_record'


class Customer < ActiveRecord::
Base


def self.find_by_install_id(install_id)


find( :first, :conditions => ["install_id = ?", install_id])


end

end


Datawarehouse.rb

# require AR

require 'rubygems'


module Datawarehouse




def Datawarehouse.translate_month_to_quarter(month)


i
f (1..3).include?(month)


return 1


elsif (4..6).include?(month)


return 2


elsif (7..9).include?(month)


return 3


elsif (10..12).include?(month)


return 4


end


end





def Datawarehouse.create_via_unix_timestamp(unix_da
te_time_stamp)




date_time = Time.at(unix_date_time_stamp.to_i)




structure = DateTimeStructure.new


structure.unix_date_time = unix_date_time_stamp


structure.day_of_week = date_time.wday


structure.day_number_in_month = date_time.
day


structure.day_number_in_year = date_time.yday


structure.month = date_time.month


structure.quarter = translate_month_to_quarter(date_time.month)


structure.year = date_time.year


structure.seconds = date_time.sec


structure.minutes
= date_time.min


structure.hour = date_time.hour




return structure


end

end


class DateTimeStructure


attr_accessor(:day_of_week,


:day_number_in_month,


:day_number_in_year,


:month,



:quarter,


:year,


:seconds,


:minutes,


:hour,


:unix_date_time)

End


DesktopAction.rb

require 'rubygems'

require 'active_record'


18


class DesktopAction < ActiveRecord::Base


def
self.find_by_true_action_id(id)


find( :first, :conditions => ["true_action_id = ?", id])


end

end


Event.rb

# require AR

require 'rubygems'

require 'active_record'


class Event < ActiveRecord::Base


end


load_actions.rb

require 'DesktopAction.rb'

requ
ire 'rubygems'

require 'active_record'


# Connect to MySQL database

database_spec = {


:adapter => 'mysql',


:host => '127.0.0.1',


:database => 'desktopweatherwarehouse',


:username => 'root',


:password => ''

}


# connect to the database

Active
Record::Base.establish_connection database_spec


File.open("C:/Junk/WeatherChannel/rawfiles/tagconfig.csv", "r") do |file|


file.each_line() do |line|


tokens = line.split(',')


true_action_id = tokens[1].to_i


existing_action = DesktopAction.fin
d_by_true_action_id(true_action_id)


if(!existing_action)


new_action = DesktopAction.new


new_action.description = tokens[3]


new_action.monetization = tokens[5].to_f


new_action.true_action_id = true_action_id


new_action.save


end


end

end




19

6.

References

[01]
Kimball, R. and Ross, M. 2002
The Data Warehouse Toolkit: The Complete
Guide to Dimensional Modeling
. 2nd. John Wiley & Sons, Inc.

[02]

Imhoff, C., Geiger, J. G., and Galemmo, N. 2003
Relational Modeling and
Data Wareho
use Design
. John Wiley & Sons, Inc.

[03]

Joshi, K. P., Joshi, A., Yesha, Y., and Krishnapuram, R. 1999. Warehousing
and mining Web logs. In
Proceedings of the 2nd international Workshop on Web
information and Data Management

(Kansas City, Missouri, United
States,
November 02
-

06, 1999). C. Shahabi, Ed. WIDM '99. ACM Press, New York,
NY, 63
-
68. DOI= http://doi.acm.org/10.1145/319759.319792

[04]

Sammon, D. and Finnegan, P. 2000. The ten commandments of data
warehousing.
SIGMIS Database

31, 4 (Sep. 2000), 82
-
91. DOI=
http://doi.acm.org/10.1145/506760.506767

[05]

Srivastava, J., Cooley, R., Deshpande, M., and Tan, P. 2000. Web usage
mining: discovery and applications of usage patterns from Web data.
SIGKDD
Explor. Newsl.

1, 2 (Jan. 2000), 12
-
23. DOI=
http://doi
.acm.org/10.1145/846183.846188

[06]

Wang, Q., Makaroff, D. J., and Edwards, H. K. 2004. Characterizing
customer groups for an e
-
commerce website. In
Proceedings of the 5th ACM
Conference on Electronic Commerce

(New York, NY, USA, May 17
-

20, 2004).
EC '04
. ACM Press, New York, NY, 218
-
227. DOI=
http://doi.acm.org/10.1145/988772.988805

[07]

Spiliopoulou, M. 2000. Web usage mining for Web site evaluation.
Commun.
ACM

43, 8 (Aug. 2000), 127
-
134. DOI=
http://doi.acm.org/10.1145/345124.345167

[08]

Joshi, K. P.,

Joshi, A., Yesha, Y., and Krishnapuram, R. 1999. Warehousing
and mining Web logs. In
Proceedings of the 2nd international Workshop on Web
information and Data Management

(Kansas City, Missouri, United States,
November 02
-

06, 1999). C. Shahabi, Ed. WIDM
'99. ACM Press, New York,
NY, 63
-
68. DOI= http://doi.acm.org/10.1145/319759.319792

[09]

http://www.linuxjournal.com/article/7922

[10]
http://www.ailab.si/orange

[11]
http://en.wikipedia.org/wiki/Cluster_Analysis

[12]

http://www.cs.waikato.ac.nz/~ml/weka/index.html