1
Feature Discovery in the Context of
Educational Data Mining:
An Inductive Approach
Andrew Arnold, Joseph E. Beck and Richard Scheines
Machine Learning Department
Carnegie Mellon University
July 6, 2006
2
Contributions
•
Formulation of feature discovery problem in
educational data mining domain
•
Introduction and evaluation of algorithm that:
–
Discovers useful, complex features
–
Incorporates prior knowledge
–
Promotes the scientific process
–
Balances between predictiveness and interpretability
3
Outline
•
The Big Problem
–
Examples
–
Why is it hard?
•
An Investigation
–
Details of Our Experimental Environment
–
Lessons Learned from Investigational Experiments
•
Our Solution
–
Algorithm
–
Results
•
Conclusions and Next Steps
4
Problem: Features (Conceptual)
•
Many spaces are too complex to deal with
•
Features
are ways of simplifying these spaces
by adding useful structure
–
Domain: raw data
features
–
Vision: pixels
3

d objects
–
Speech: frequencies + waves
phonemes
–
Chess: board layout
king is protected / exposed
5
Problem: Features (Example)
•
A poker hand consists of 5 cards drawn from a deck of
52 unique cards. This is the raw data.
–
This yields (52 choose 5) = 2,598,960 unique hands
•
onePair
is a possible feature of this space.
–
There are 1,098,240
different
ways to have exactly one pair
–
Thus, with a single feature, we have reduced the size of our
space by over 40%
•
But
onePair
is only one of many, many possible
features:
–
twoPair
(123,552),
fullHouse
(3744),
fourOfaKind
(624)
•
Not all features are useful, most are not:
–
3spades
,
oneAce_and_OneNine
,
3primeCards
•
Need an efficient way to find useful features
6
Problem: Models (Example)
•
Given features, still need a model
–
For poker, the model is simple because it is explicit:
•
Features are ranked. Better features
Better chance of winning
•
Educational example:
–
Given features
SATmathScore
,
preTestScore
, and
curiosity
–
Want to predict
finalExamScore
SATmathScore
preTest
Score
Curiosity
finalExamScore
Model
Linear regression
Neural net
Etc.
7
Problem: Operationalizing Features
But how to operationalize the feature
curiosity
?
•
Each possible mapping of raw data into
curiosity
(e.g., curiosity_1, curiosity_2, curiosity_3),
increases
the space of models to search.
SATmathScore
preTestScore
Curiosity
_1
finalExamScore
Model
Linear regression
Neural net
Etc.
SATmathScore
preTestScore
Curiosity
_
2
finalExamScore
Model
Linear regression
Neural net
Etc.
SATmathScore
preTestScore
Curiosity
_
3
finalExamScore
Model
Linear regression
Neural net
Etc.
Etc….
8
Our Research problem
SATmathScore
preTestScore
Curiosity
_1
finalExamScore
Model
Linear regression
Neural net
Etc.
SATmathScore
preTestScore
Curiosity
_
2
finalExamScore
Model
Linear regression
Neural net
Etc.
SATmathScore
preTestScore
Curiosity
_
3
finalExamScore
Model
Linear regression
Neural net
Etc.
Raw data
9
Details of Our Environment I
Data & Course
•
On

line course teaching causal reasoning skills
–
consists of sixteen modules, about an hour per module
•
The course was tooled to record certain events:
–
Logins, page requests, self assessments, quiz attempts, logouts
•
Each event was associated with certain attributes:
–
time
–
student

id
–
session

id
10
What We’d Like to Be Able to Do
•
Raw data
–
At
17:51:23
student
jd555
requested page
causality_17
–
At
17:51:41
student
rp22
began simulation
sim_3
–
At
17:51:47
student
ap29
finished
quiz_17
with score
82%
•
Feature
–
Student
jd555
is five times as
curious
as
ap29
•
Model
–
For every 5% increase in
curiosity
, student quiz performance
increases by 3%.
11
Details of Our Environment II
Models & Experiments
•
Wanted to find features associated with
engagement and learning.
•
For engagement, used the amount of time students
spent looking at pages in a module.
•
For learning, looked at quiz scores.
•
For all experiments, only looked at linear
regression models.
12
Lesson 1: Obvious Ideas Don’t
Always Work
•
To measure engagement, examined the amount of
time a user spent on a page
•
To predict this time, used three features:
–
student
:
mean
time
this
student
spent
per
page
–
session
:
mean
time
spent
on
pages
during
this
session
–
page
:
mean
time
spent
by
all
students
on
this
page
•
Which features would you guess would be most
significant for predicting time spent on a page?
13
Lesson 1: Obvious Ideas Don’t
Always Work
•
To measure engagement, examined the amount of
time a user spent on a page
•
To predict this time, used three features:
–
student
:
mean
time
this
student
spent
per
page
–
session
:
mean
time
spent
on
pages
during
this
session
–
page
:
mean
time
spent
by
all
students
on
this
page
•
Which features would you guess would be most
significant for predicting time spent on a page?
–
Our belief was:
page
>
student
>>
session
14
Turns Out Session Trumps User
•
In fact, given
session
,
student
had no effect.
•
R

squared of a linear model using:
–
student
=
4.8%
–
page
=
16.6%
–
session
=
19.9%
–
session
+ student
=
19.9%
–
session + page
= 31.4%
–
session
+
page
+
student
= 31.5%
15
Lesson 2: Small Differences in
Features Can Have Big Impact
•
self_assessments
measures the number of optional self
assessment questions a student attempted.
•
How well would this feature predict learning?
•
To measure this, we needed an outcome feature that
measured performance
•
Our idea was to look at quiz scores.
•
But what, exactly, is a quiz score?
–
Students can take a quiz up to three times in a module.
–
Should we look at the maximum of these scores? The mean?
16
Only Max Score Mattered
•
Max is significant, but mean is not.
•
Yet max and mean are both encompassed by the term “quiz score”
–
Researchers should not be expected to make such fine distinctions
p

value : .504
p

value : .036
Score vs self_assessments
Self_assessments (normed) Self_assessments (normed)
17
Automation
•
Given these lessons, how can we automate
the process?
–
Enumeration
•
Costly, curse of dimensionality
–
Principle component analysis, kernels
•
Interpretation
18
Challenges
•
Defining and searching feature space
–
Expressive enough to discover new features
•
Constraining and biasing
–
Avoid nonsensical or hard to interpret features
19
Algorithm
•
Start with small set of core features
•
Iteratively grow and prune this set
–
Increase predictiveness
–
Preserve scientific and semantic interpretability
20
Architecture
21
Experiment
•
Can we predict student’s quiz score using
features that are:
–
Automatically discovered
–
Complex
–
Predictive
–
Interpretable
22
Raw Data
NAME
DESCRIPTION
User_id
(Nominal) Unique user identifier
Module_id
(Nominal) Unique module identifier
Assess_quiz
(Ordinal) Number of self

assessment
quizzes taken by this user in this module
Assess_quest
(Ordinal) Number of self

assessment questions
taken by this user in this module. Each self

assessment quiz contains multiple self

assessment
questions.
Quiz_score
(Ordinal)
(Dependent variable)
% of quiz questions
answered correctly by this student in this module.
In each module, students were given the chance to
take the quiz up to three times. The max of these
trials was taken to be
quiz_score.
23
Sample Data
User_id
Module_id
Assess
quiz
Assess
quest
Quiz
score
Alice
module_1
12
27
86
Bob
module_1
14
31
74
Alice
module_2
18
35
92
Bob
module_2
13
25
87
24
Predicates
•
A logical statement applied to each row of
data
–
Selects subset of data which satisfies it
•
Examples:
User_id = Alice
Module_id = 1
25
Calculators
•
A function applied to a subset of data
–
Calculated over a certain field in the data
•
Incorporates bias and prior knowledge:
–
E.g. Timing effects are on log scale
•
Examples:
–
Mean(Assess_quiz)
–
Log(Quiz_score)
26
Candidate Features
•
Predicate + Calculator = New Feature
–
Predicate: User_id = Alice, User_id = Bob
–
Calculator: Mean(Assess_quiz)
–
Feature: Mean assess quizzes for each user
X1:
User_id
X2:
Module_id
X3:
Assess quiz
F: Mean
Assess Quiz
Y: Quiz
Score
Alice
module_1
12
13
86
Bob
module_1
14
13
74
Alice
module_2
18
15.5
92
Bob
module_2
13
15.5
87
27
Models
•
Complexity is in feature space
–
Put complicated features in simple model
•
Linear and logistic regression
28
Scoring & Pruning I
•
Exhaustive search impractical
•
Partition predicates and calculators semantically
–
Allows independent, greedy search
•
Fast Correlation

Based Filtering [Yu 2003]
–
Prevents unlikely features:
mean_social_security_number
•
Select
b
best from each category, and pool
29
Scoring & Pruning II
•
Features graded on:
–
Predictiveness: R
2
–
Interpretability: Heuristics based on experts and literature
•
Depth of nesting
•
Assigned “interpretability score” of predicates and calculators
–
E.g.
Sum
more interpretable than
SquareRoot
•
Select
k
best features to continue
–
k
regularizes run

time, memory and depth of search
30
Iteration & Stopping Criteria
•
Full model is evaluated after each iteration
•
Stopping conditions:
–
Cross

validation performance
–
Hard cap on processor time or iterations
•
If met:
–
Stop and return discovered features and model
•
If not:
–
Iterate again, using current features as seeds for next step
31
Results
•
Two main goals:
–
Machine Learning:
•
Discover features predictive of student performance
–
Scientific Discovery:
•
Discover interpretable features incorporating prior
scientific knowledge
32
Machine Learning
•
24 students, 15 modules, 203 quiz scores:
–
Predict:
quiz_score
–
Given initial features:
•
user_id
,
assess_quiz
,
assess_quest
•
Learn features and regression coefficients on
training data, test on held out data
•
38%
improvement in R
2
of
discovered features
over baseline regression on
initial features
33
Summary of Features
34
Scientific Discovery
•
Interpretation of features and model
–
mean_assess_
quizzes
_per_user
•
Introspectivness of student
•
Intuitively negatively correlated with quiz score
–
Less mastery
insecurity
self assessment
poor quiz
•
Mean_assess_
quest
_per_user
should be similarly correlated
–
In fact, regression coefficients have opposite signs
•
Discovered features reaffirm certain intuitions and
contradict others
35
Generality
•
Applied same framework to entirely
different data and domain:
–
Effect of tutor interventions on reading
comprehension
•
Achieved similarly significant results with
no substantial changes to algorithm
36
Limitations
•
Looked at small number of initial features
•
To increase feature capacity:
–
Better partition of features, predicates, calculators
–
Less greedy search
–
More expressive, biased interpretability scores
•
E.g. Time of day and day of week:
Doing homework on Sunday night vs. Friday night
37
Better & Faster Search
•
Want to discover more complicated features
–
Search more broadly:
•
Prune fewer features
–
Search more deeply:
•
Run more iterations
•
Decomposable feature scores:
–
Reuse computation
•
Smoother feature space parameterization:
–
Efficient, gradient

like search
38
Conclusions
•
Algorithm discovers useful, complex features
–
Elucidates underlying structure
–
Hides complexity
•
Promotes scientific process
–
Tests hypotheses and suggests new experiments
–
Incorporates prior scientific knowledge [Pazzani 2001]
–
Results are interpretable and explainable, and still predictive
•
Balances between predictiveness and interpretability
–
Careful definition and partitioning of feature space
–
Search balances biased, score

based pruning with exploration
39
References
Arnold, A., Beck, J. E., Scheines, R. (2006). Feature Discovery in the Context of Educational Data Mining: An
Inductive Approach.
In Proceedings of the AAAI2006 Workshop on Educational Data Mining
, Boston, MA.
Pazzani, M. J., Mani, S., Shankle, W. R. (2001). Acceptance of Rules Generated by Machine Learning among
Medical Experts.
Methods of Information in Medicine
, 40:380

385.
Yu, L. and Liu, H. (2003). Feature Selection for High

Dimensional Data: A Fast Correlation

Based Filter
Solution.
In Proceedings of The Twentieth International Conference on Machine Leaning
, 856

863.
Thank You
¿
Questions ?
Comments 0
Log in to post a comment