Feature Discovery in the Context of

Διαχείριση Δεδομένων

20 Νοε 2013 (πριν από 4 χρόνια και 5 μήνες)

81 εμφανίσεις

1

Feature Discovery in the Context of

Educational Data Mining:

An Inductive Approach

Andrew Arnold, Joseph E. Beck and Richard Scheines

Machine Learning Department

Carnegie Mellon University

July 6, 2006

2

Contributions

Formulation of feature discovery problem in
educational data mining domain

Introduction and evaluation of algorithm that:

Discovers useful, complex features

Incorporates prior knowledge

Promotes the scientific process

Balances between predictiveness and interpretability

3

Outline

The Big Problem

Examples

Why is it hard?

An Investigation

Details of Our Experimental Environment

Lessons Learned from Investigational Experiments

Our Solution

Algorithm

Results

Conclusions and Next Steps

4

Problem: Features (Conceptual)

Many spaces are too complex to deal with

Features

are ways of simplifying these spaces

Domain: raw data

features

Vision: pixels

3
-
d objects

Speech: frequencies + waves

phonemes

Chess: board layout

king is protected / exposed

5

Problem: Features (Example)

A poker hand consists of 5 cards drawn from a deck of
52 unique cards. This is the raw data.

This yields (52 choose 5) = 2,598,960 unique hands

onePair
is a possible feature of this space.

There are 1,098,240
different

ways to have exactly one pair

Thus, with a single feature, we have reduced the size of our
space by over 40%

But
onePair

is only one of many, many possible
features:

twoPair

(123,552),
fullHouse
(3744),
fourOfaKind

(624)

Not all features are useful, most are not:

,
oneAce_and_OneNine
,
3primeCards

Need an efficient way to find useful features

6

Problem: Models (Example)

Given features, still need a model

For poker, the model is simple because it is explicit:

Features are ranked. Better features

Better chance of winning

Educational example:

Given features
SATmathScore
,
preTestScore
, and
curiosity

Want to predict
finalExamScore

SATmathScore

preTest
Score

Curiosity

finalExamScore

Model

Linear regression

Neural net

Etc.

7

Problem: Operationalizing Features

But how to operationalize the feature
curiosity
?

Each possible mapping of raw data into
curiosity

(e.g., curiosity_1, curiosity_2, curiosity_3),

increases

the space of models to search.

SATmathScore

preTestScore

Curiosity
_1

finalExamScore

Model

Linear regression

Neural net

Etc.

SATmathScore

preTestScore

Curiosity
_
2

finalExamScore

Model

Linear regression

Neural net

Etc.

SATmathScore

preTestScore

Curiosity
_
3

finalExamScore

Model

Linear regression

Neural net

Etc.

Etc….

8

Our Research problem

SATmathScore

preTestScore

Curiosity
_1

finalExamScore

Model

Linear regression

Neural net

Etc.

SATmathScore

preTestScore

Curiosity
_
2

finalExamScore

Model

Linear regression

Neural net

Etc.

SATmathScore

preTestScore

Curiosity
_
3

finalExamScore

Model

Linear regression

Neural net

Etc.

Raw data

9

Details of Our Environment I

Data & Course

On
-
line course teaching causal reasoning skills

consists of sixteen modules, about an hour per module

The course was tooled to record certain events:

Logins, page requests, self assessments, quiz attempts, logouts

Each event was associated with certain attributes:

time

student
-
id

session
-
id

10

What We’d Like to Be Able to Do

Raw data

At
17:51:23

student
jd555

requested page
causality_17

At
17:51:41

student
rp22

began simulation
sim_3

At
17:51:47

student
ap29

finished
quiz_17
with score
82%

Feature

Student
jd555

is five times as
curious

as
ap29

Model

For every 5% increase in
curiosity
, student quiz performance
increases by 3%.

11

Details of Our Environment II

Models & Experiments

Wanted to find features associated with
engagement and learning.

For engagement, used the amount of time students
spent looking at pages in a module.

For learning, looked at quiz scores.

For all experiments, only looked at linear
regression models.

12

Lesson 1: Obvious Ideas Don’t
Always Work

To measure engagement, examined the amount of
time a user spent on a page

To predict this time, used three features:

student
:

mean

time

this

student

spent

per

page

session
:

mean

time

spent

on

pages

during

this

session

page
:

mean

time

spent

by

all

students

on

this

page

Which features would you guess would be most
significant for predicting time spent on a page?

13

Lesson 1: Obvious Ideas Don’t
Always Work

To measure engagement, examined the amount of
time a user spent on a page

To predict this time, used three features:

student
:

mean

time

this

student

spent

per

page

session
:

mean

time

spent

on

pages

during

this

session

page
:

mean

time

spent

by

all

students

on

this

page

Which features would you guess would be most
significant for predicting time spent on a page?

Our belief was:
page
>
student
>>
session

14

Turns Out Session Trumps User

In fact, given
session
,
student

R
-
squared of a linear model using:

student

=
4.8%

page

=
16.6%

session

=
19.9%

session
+ student

=
19.9%

session + page

= 31.4%

session

+
page

+
student

= 31.5%

15

Lesson 2: Small Differences in
Features Can Have Big Impact

self_assessments
measures the number of optional self
assessment questions a student attempted.

How well would this feature predict learning?

To measure this, we needed an outcome feature that
measured performance

Our idea was to look at quiz scores.

But what, exactly, is a quiz score?

Students can take a quiz up to three times in a module.

Should we look at the maximum of these scores? The mean?

16

Only Max Score Mattered

Max is significant, but mean is not.

Yet max and mean are both encompassed by the term “quiz score”

Researchers should not be expected to make such fine distinctions

p
-
value : .504

p
-
value : .036

Score vs self_assessments

Self_assessments (normed) Self_assessments (normed)

17

Automation

Given these lessons, how can we automate
the process?

Enumeration

Costly, curse of dimensionality

Principle component analysis, kernels

Interpretation

18

Challenges

Defining and searching feature space

Expressive enough to discover new features

Constraining and biasing

Avoid nonsensical or hard to interpret features

19

Algorithm

Iteratively grow and prune this set

Increase predictiveness

Preserve scientific and semantic interpretability

20

Architecture

21

Experiment

Can we predict student’s quiz score using
features that are:

Automatically discovered

Complex

Predictive

Interpretable

22

Raw Data

NAME

DESCRIPTION

User_id

(Nominal) Unique user identifier

Module_id

(Nominal) Unique module identifier

Assess_quiz

(Ordinal) Number of self
-
assessment

quizzes taken by this user in this module

Assess_quest

(Ordinal) Number of self
-
assessment questions
taken by this user in this module. Each self
-
assessment quiz contains multiple self
-
assessment
questions.

Quiz_score

(Ordinal)
(Dependent variable)

% of quiz questions
answered correctly by this student in this module.
In each module, students were given the chance to
take the quiz up to three times. The max of these
trials was taken to be
quiz_score.

23

Sample Data

User_id

Module_id

Assess

quiz

Assess

quest

Quiz

score

Alice

module_1

12

27

86

Bob

module_1

14

31

74

Alice

module_2

18

35

92

Bob

module_2

13

25

87

24

Predicates

A logical statement applied to each row of
data

Selects subset of data which satisfies it

Examples:

User_id = Alice

Module_id = 1

25

Calculators

A function applied to a subset of data

Calculated over a certain field in the data

Incorporates bias and prior knowledge:

E.g. Timing effects are on log scale

Examples:

Mean(Assess_quiz)

Log(Quiz_score)

26

Candidate Features

Predicate + Calculator = New Feature

Predicate: User_id = Alice, User_id = Bob

Calculator: Mean(Assess_quiz)

Feature: Mean assess quizzes for each user

X1:

User_id

X2:

Module_id

X3:

Assess quiz

F: Mean

Assess Quiz

Y: Quiz

Score

Alice

module_1

12

13

86

Bob

module_1

14

13

74

Alice

module_2

18

15.5

92

Bob

module_2

13

15.5

87

27

Models

Complexity is in feature space

Put complicated features in simple model

Linear and logistic regression

28

Scoring & Pruning I

Exhaustive search impractical

Partition predicates and calculators semantically

Allows independent, greedy search

Fast Correlation
-
Based Filtering [Yu 2003]

Prevents unlikely features:

mean_social_security_number

Select
b

best from each category, and pool

29

Scoring & Pruning II

Predictiveness: R
2

Interpretability: Heuristics based on experts and literature

Depth of nesting

Assigned “interpretability score” of predicates and calculators

E.g.
Sum
more interpretable than
SquareRoot

Select
k
best features to continue

k
regularizes run
-
time, memory and depth of search

30

Iteration & Stopping Criteria

Full model is evaluated after each iteration

Stopping conditions:

Cross
-
validation performance

Hard cap on processor time or iterations

If met:

Stop and return discovered features and model

If not:

Iterate again, using current features as seeds for next step

31

Results

Two main goals:

Machine Learning:

Discover features predictive of student performance

Scientific Discovery:

Discover interpretable features incorporating prior
scientific knowledge

32

Machine Learning

24 students, 15 modules, 203 quiz scores:

Predict:
quiz_score

Given initial features:

user_id
,
assess_quiz
,
assess_quest

Learn features and regression coefficients on
training data, test on held out data

38%

improvement in R
2

of
discovered features

over baseline regression on
initial features

33

Summary of Features

34

Scientific Discovery

Interpretation of features and model

mean_assess_
quizzes
_per_user

Introspectivness of student

Intuitively negatively correlated with quiz score

Less mastery

insecurity

self assessment

poor quiz

Mean_assess_
quest
_per_user

should be similarly correlated

In fact, regression coefficients have opposite signs

Discovered features reaffirm certain intuitions and

35

Generality

Applied same framework to entirely
different data and domain:

Effect of tutor interventions on reading
comprehension

Achieved similarly significant results with
no substantial changes to algorithm

36

Limitations

Looked at small number of initial features

To increase feature capacity:

Better partition of features, predicates, calculators

Less greedy search

More expressive, biased interpretability scores

E.g. Time of day and day of week:

Doing homework on Sunday night vs. Friday night

37

Better & Faster Search

Want to discover more complicated features

Prune fewer features

Search more deeply:

Run more iterations

Decomposable feature scores:

Reuse computation

Smoother feature space parameterization:

-
like search

38

Conclusions

Algorithm discovers useful, complex features

Elucidates underlying structure

Hides complexity

Promotes scientific process

Tests hypotheses and suggests new experiments

Incorporates prior scientific knowledge [Pazzani 2001]

Results are interpretable and explainable, and still predictive

Balances between predictiveness and interpretability

Careful definition and partitioning of feature space

Search balances biased, score
-
based pruning with exploration

39

References

Arnold, A., Beck, J. E., Scheines, R. (2006). Feature Discovery in the Context of Educational Data Mining: An
Inductive Approach.
In Proceedings of the AAAI2006 Workshop on Educational Data Mining
, Boston, MA.

Pazzani, M. J., Mani, S., Shankle, W. R. (2001). Acceptance of Rules Generated by Machine Learning among
Medical Experts.
Methods of Information in Medicine
, 40:380
--
385.

Yu, L. and Liu, H. (2003). Feature Selection for High
-
Dimensional Data: A Fast Correlation
-
Based Filter
Solution.
In Proceedings of The Twentieth International Conference on Machine Leaning
, 856
--
863.

Thank You

¿
Questions ?