Feature Discovery in the Context of

sentencehuddleData Management

Nov 20, 2013 (3 years and 28 days ago)

58 views

1

Feature Discovery in the Context of

Educational Data Mining:

An Inductive Approach

Andrew Arnold, Joseph E. Beck and Richard Scheines

Machine Learning Department

Carnegie Mellon University

July 6, 2006


2

Contributions


Formulation of feature discovery problem in
educational data mining domain



Introduction and evaluation of algorithm that:


Discovers useful, complex features


Incorporates prior knowledge


Promotes the scientific process


Balances between predictiveness and interpretability


3

Outline


The Big Problem


Examples


Why is it hard?


An Investigation


Details of Our Experimental Environment


Lessons Learned from Investigational Experiments


Our Solution


Algorithm


Results


Conclusions and Next Steps


4

Problem: Features (Conceptual)


Many spaces are too complex to deal with


Features

are ways of simplifying these spaces
by adding useful structure


Domain: raw data


features


Vision: pixels


3
-
d objects


Speech: frequencies + waves


phonemes


Chess: board layout


king is protected / exposed


5

Problem: Features (Example)


A poker hand consists of 5 cards drawn from a deck of
52 unique cards. This is the raw data.


This yields (52 choose 5) = 2,598,960 unique hands


onePair
is a possible feature of this space.


There are 1,098,240
different

ways to have exactly one pair


Thus, with a single feature, we have reduced the size of our
space by over 40%


But
onePair

is only one of many, many possible
features:


twoPair

(123,552),
fullHouse
(3744),
fourOfaKind

(624)


Not all features are useful, most are not:


3spades
,
oneAce_and_OneNine
,
3primeCards


Need an efficient way to find useful features

6

Problem: Models (Example)


Given features, still need a model


For poker, the model is simple because it is explicit:


Features are ranked. Better features


Better chance of winning


Educational example:


Given features
SATmathScore
,
preTestScore
, and
curiosity


Want to predict
finalExamScore




SATmathScore

preTest
Score

Curiosity

finalExamScore

Model



Linear regression



Neural net



Etc.

7

Problem: Operationalizing Features

But how to operationalize the feature
curiosity
?


Each possible mapping of raw data into
curiosity


(e.g., curiosity_1, curiosity_2, curiosity_3),

increases

the space of models to search.




SATmathScore

preTestScore

Curiosity
_1

finalExamScore

Model



Linear regression



Neural net



Etc.


SATmathScore

preTestScore

Curiosity
_
2

finalExamScore

Model



Linear regression



Neural net



Etc.


SATmathScore

preTestScore

Curiosity
_
3

finalExamScore

Model



Linear regression



Neural net



Etc.

Etc….

8

Our Research problem


SATmathScore

preTestScore

Curiosity
_1

finalExamScore

Model



Linear regression



Neural net



Etc.


SATmathScore

preTestScore

Curiosity
_
2

finalExamScore

Model



Linear regression



Neural net



Etc.


SATmathScore

preTestScore

Curiosity
_
3

finalExamScore

Model



Linear regression



Neural net



Etc.

Raw data

9

Details of Our Environment I

Data & Course


On
-
line course teaching causal reasoning skills


consists of sixteen modules, about an hour per module


The course was tooled to record certain events:


Logins, page requests, self assessments, quiz attempts, logouts


Each event was associated with certain attributes:


time


student
-
id


session
-
id

10

What We’d Like to Be Able to Do


Raw data


At
17:51:23

student
jd555

requested page
causality_17


At
17:51:41

student
rp22

began simulation
sim_3


At
17:51:47

student
ap29

finished
quiz_17
with score
82%


Feature


Student
jd555

is five times as
curious

as
ap29



Model


For every 5% increase in
curiosity
, student quiz performance
increases by 3%.


11

Details of Our Environment II

Models & Experiments


Wanted to find features associated with
engagement and learning.


For engagement, used the amount of time students
spent looking at pages in a module.


For learning, looked at quiz scores.


For all experiments, only looked at linear
regression models.

12

Lesson 1: Obvious Ideas Don’t
Always Work


To measure engagement, examined the amount of
time a user spent on a page


To predict this time, used three features:


student
:

mean

time

this

student

spent

per

page


session
:

mean

time

spent

on

pages

during

this

session


page
:

mean

time

spent

by

all

students

on

this

page


Which features would you guess would be most
significant for predicting time spent on a page?

13

Lesson 1: Obvious Ideas Don’t
Always Work


To measure engagement, examined the amount of
time a user spent on a page


To predict this time, used three features:


student
:

mean

time

this

student

spent

per

page


session
:

mean

time

spent

on

pages

during

this

session


page
:

mean

time

spent

by

all

students

on

this

page


Which features would you guess would be most
significant for predicting time spent on a page?


Our belief was:
page
>
student
>>
session

14

Turns Out Session Trumps User


In fact, given
session
,
student

had no effect.


R
-
squared of a linear model using:


student




=
4.8%


page




=
16.6%


session




=
19.9%


session
+ student



=
19.9%


session + page



= 31.4%


session

+
page

+
student

= 31.5%

15

Lesson 2: Small Differences in
Features Can Have Big Impact


self_assessments
measures the number of optional self
assessment questions a student attempted.


How well would this feature predict learning?


To measure this, we needed an outcome feature that
measured performance


Our idea was to look at quiz scores.


But what, exactly, is a quiz score?


Students can take a quiz up to three times in a module.


Should we look at the maximum of these scores? The mean?

16

Only Max Score Mattered


Max is significant, but mean is not.


Yet max and mean are both encompassed by the term “quiz score”


Researchers should not be expected to make such fine distinctions

p
-
value : .504

p
-
value : .036

Score vs self_assessments


Self_assessments (normed) Self_assessments (normed)

17

Automation


Given these lessons, how can we automate
the process?


Enumeration


Costly, curse of dimensionality


Principle component analysis, kernels


Interpretation

18

Challenges


Defining and searching feature space


Expressive enough to discover new features



Constraining and biasing


Avoid nonsensical or hard to interpret features

19

Algorithm


Start with small set of core features



Iteratively grow and prune this set


Increase predictiveness


Preserve scientific and semantic interpretability

20

Architecture

21

Experiment


Can we predict student’s quiz score using
features that are:


Automatically discovered


Complex


Predictive


Interpretable


22

Raw Data

NAME

DESCRIPTION

User_id

(Nominal) Unique user identifier

Module_id

(Nominal) Unique module identifier

Assess_quiz

(Ordinal) Number of self
-
assessment

quizzes taken by this user in this module

Assess_quest

(Ordinal) Number of self
-
assessment questions
taken by this user in this module. Each self
-
assessment quiz contains multiple self
-
assessment
questions.

Quiz_score

(Ordinal)
(Dependent variable)

% of quiz questions
answered correctly by this student in this module.
In each module, students were given the chance to
take the quiz up to three times. The max of these
trials was taken to be
quiz_score.

23

Sample Data

User_id

Module_id


Assess


quiz

Assess


quest


Quiz

score

Alice

module_1

12


27


86

Bob

module_1

14


31


74

Alice

module_2

18


35


92

Bob

module_2

13


25


87

24

Predicates


A logical statement applied to each row of
data


Selects subset of data which satisfies it



Examples:


User_id = Alice


Module_id = 1

25

Calculators


A function applied to a subset of data


Calculated over a certain field in the data


Incorporates bias and prior knowledge:


E.g. Timing effects are on log scale


Examples:


Mean(Assess_quiz)


Log(Quiz_score)

26

Candidate Features


Predicate + Calculator = New Feature


Predicate: User_id = Alice, User_id = Bob


Calculator: Mean(Assess_quiz)


Feature: Mean assess quizzes for each user


X1:


User_id

X2:

Module_id

X3:

Assess quiz


F: Mean

Assess Quiz

Y: Quiz

Score

Alice

module_1

12


13


86

Bob

module_1

14


13


74

Alice

module_2

18


15.5


92

Bob

module_2

13


15.5


87

27

Models


Complexity is in feature space


Put complicated features in simple model



Linear and logistic regression





28

Scoring & Pruning I


Exhaustive search impractical


Partition predicates and calculators semantically


Allows independent, greedy search


Fast Correlation
-
Based Filtering [Yu 2003]


Prevents unlikely features:




mean_social_security_number


Select
b

best from each category, and pool

29

Scoring & Pruning II


Features graded on:


Predictiveness: R
2


Interpretability: Heuristics based on experts and literature


Depth of nesting


Assigned “interpretability score” of predicates and calculators


E.g.
Sum
more interpretable than
SquareRoot



Select
k
best features to continue


k
regularizes run
-
time, memory and depth of search


30

Iteration & Stopping Criteria


Full model is evaluated after each iteration


Stopping conditions:


Cross
-
validation performance


Hard cap on processor time or iterations


If met:


Stop and return discovered features and model


If not:


Iterate again, using current features as seeds for next step

31

Results


Two main goals:


Machine Learning:


Discover features predictive of student performance



Scientific Discovery:


Discover interpretable features incorporating prior
scientific knowledge

32

Machine Learning


24 students, 15 modules, 203 quiz scores:


Predict:
quiz_score


Given initial features:


user_id
,
assess_quiz
,
assess_quest


Learn features and regression coefficients on
training data, test on held out data


38%

improvement in R
2

of
discovered features

over baseline regression on
initial features


33

Summary of Features

34

Scientific Discovery


Interpretation of features and model


mean_assess_
quizzes
_per_user


Introspectivness of student


Intuitively negatively correlated with quiz score


Less mastery


insecurity


self assessment


poor quiz


Mean_assess_
quest
_per_user

should be similarly correlated


In fact, regression coefficients have opposite signs


Discovered features reaffirm certain intuitions and
contradict others

35

Generality


Applied same framework to entirely
different data and domain:


Effect of tutor interventions on reading
comprehension



Achieved similarly significant results with
no substantial changes to algorithm




36

Limitations


Looked at small number of initial features


To increase feature capacity:


Better partition of features, predicates, calculators


Less greedy search


More expressive, biased interpretability scores


E.g. Time of day and day of week:

Doing homework on Sunday night vs. Friday night


37

Better & Faster Search


Want to discover more complicated features


Search more broadly:


Prune fewer features


Search more deeply:


Run more iterations


Decomposable feature scores:


Reuse computation


Smoother feature space parameterization:


Efficient, gradient
-
like search

38

Conclusions


Algorithm discovers useful, complex features


Elucidates underlying structure


Hides complexity


Promotes scientific process


Tests hypotheses and suggests new experiments


Incorporates prior scientific knowledge [Pazzani 2001]


Results are interpretable and explainable, and still predictive


Balances between predictiveness and interpretability


Careful definition and partitioning of feature space


Search balances biased, score
-
based pruning with exploration

39

References


Arnold, A., Beck, J. E., Scheines, R. (2006). Feature Discovery in the Context of Educational Data Mining: An
Inductive Approach.
In Proceedings of the AAAI2006 Workshop on Educational Data Mining
, Boston, MA.



Pazzani, M. J., Mani, S., Shankle, W. R. (2001). Acceptance of Rules Generated by Machine Learning among
Medical Experts.
Methods of Information in Medicine
, 40:380
--
385.



Yu, L. and Liu, H. (2003). Feature Selection for High
-
Dimensional Data: A Fast Correlation
-
Based Filter
Solution.
In Proceedings of The Twentieth International Conference on Machine Leaning
, 856
--
863.




Thank You


¿
Questions ?