Demographics and Weblog

desertcockatooΔιαχείριση Δεδομένων

20 Νοε 2013 (πριν από 3 χρόνια και 4 μήνες)

60 εμφανίσεις

Demographics and Weblog
Hackathon



Case Study

5.3% of Motley Fool visitors are subscribers.
Design a
classificaiton

model for insight into
which variables are important for strategies to
increase the subscription rate

Learn by Doing

http://
www.meetup.com
/
HandsOnPro
grammingEvents
/


Data Mining
Hackathon

Funded by
Rapleaf



With Motley Fool’s data


App note for
Rapleaf
/Motley Fool


Template for other
hackathons


Did not use AWS. R on individual PCs


Logisics
:
Rapleaf

funded prizes and food for 2
weekends for ~20
-
50. Venue was free

Getting more subscribers

Headline Data, Weblog

Demographics

Cleaning Data


t
raining.csv
(201,000),
headlines.tsv
(811MB),
entry.tsv
(100k),
demographics.tsv


Feature Engineering


Github
:

Ensemble Methods


Bagging, Boosting,
randomForests


Overfitting


Stability (small changes make large prediction
changes)


Previously none of these work at scale


Small scale results using R, large scale exist in
proprietary implementations(
google
, amazon,
etc..)


ROC Curves

Binary Classifier Only!

Paid Subscriber ROC curve, ~61%

Boosted Regression Trees Performance


training data ROC score = 0.745


cv ROC score = 0.737 ; se = 0.002


5.5% less performance than the winning score
without doing any data processing


Random is 50% or .50. We are .737
-
.50 better
than random by 23.7%


Contribution of predictor variables

Predictive Importance


Friedman, number of times a variable is selected for splitting weighted by
squared error or improvement to model. Measure of
sparsity

in data


Fit plots remove averages of model variables


1
pageV

74.0567852


2
loc

11.0801383


3 income 4.1565597


4 age 3.1426519


5
residlen

3.0813927


6 home 2.3308287


7 marital 0.6560258


8 sex 0.6476549


9 prop 0.3817017


10 child 0.2632598


11 own 0.2030012


Behavioral vs. Demographics




Demographics are sparse



Behavioral weblogs are the best source. Most
sites aren’t using this information correctly.
There is no single correct answer. Trial and
Error on features. The features are more
important than the algorithm


Linear vs. Nonlinear

Fitted Values (Crappy)

Fitted Values Better

Predictor Variable Interaction


Adjusting variable
interactions

Variable Interactions

Plot Interactions age,
loc

Trees vs. other methods


Can see multiple levels good for trees. Do
other variables match this? Simplify model or
add more features. Iterate to a better model


No Math. Analyst

Number of Trees

Data Set Number of Trees

Hackathon

Results

Weblogs only 68.15%, 18% better than
random

Demographics add 1%

AWS Advantages


Running multiple instances with different
algorithms and parameters using R


Add tutorial, install Screen, R GUI bugs


http://amazonlabs.pbworks.com/w/page/280
36646/
FrontPage



Conclusion




Data Mining at scale requires more development
in visualization, MR algorithms, MR data
preprocessing.


Tuning using visualization. Tune 3 parameters,
tc
,
lr
, #trees. Didn’t cover 2/3.


This isn’t
reproducable

in
Hadoop
/Mahout or any
open source code I know of


Other use cases, i.e. predicting which item will
sell(eBay), search engine ranking.


Careful with MR paradigms,
Hadoop

MR !=
Couchbase

MR