pptx

hordeprobableBiotechnology

Oct 4, 2013 (4 years and 1 month ago)

113 views

Advanced Methods and Analysis for
the Learning and Social Sciences

PSY505

Spring term, 2012

March 16, 2012

Today’s Class


Motif Extraction

Today…


We’re going to discuss a method that I’ve
never used



In fact, to the best of my knowledge it has
only been used once in EDM



It is a key method in bioinformatics, and I
think it has a lot of potential for EDM

Today…


You might ask: why are we discussing it?



It is a key method in bioinformatics, and I
think it has a lot of potential for EDM

Since…


It‘s not a well
-
established method in EDM



We’ll focus on a single paper more than usual



And brainstorm together for how the method
might be applicable more broadly in
educational problems


As well as other relevant problems in the social
sciences

Motif


Short, recurring pattern in a sequence of
categories occurring over time

Motif in Music


Short, recurring pattern of notes in a musical
composition


Motif in Music


What’s the motif?



http://
www.youtube.com/watch?v=rRgXUFnf
KIY



How many times does the motif occur?

Motif in Music


What’s the motif?



http://
www.youtube.com/watch?v=rRgXUFnf
KIY



How many times does the motif occur?


Depends on how you define it, right?


And that’s part of the challenge…

Motif in Language


Short, recurring pattern of characters
in a
sequence of
characters occurring
over time


Motif in Genetics


Short, recurring pattern of genes in
a
sequence of
genes occurring
over time



Typically written as letters

Goal of Motif Extraction


Discern a common pattern of characters in a
large corpus of characters


The characters may vary slightly from case to
case

Can you find the motif?


Can you find the motif?

UBSWWDFKLWPRHUC

JBDPXBDVEJVMBKK

VBDWNLROFVUBFFW

OWIFTIENDOXJXIOB

AUAAOOXZAABZSBT

MUAWSNTVZXSFHMI

LFQRKUTFRIENDOV

LOMTPOQHJVYYMFJ

LWGJMVPKYOZNMSA

RUPMFOHPVSPPVPT

BAZXVFTPQFQJVBM

HLPMOKUOXGRIENDO

INUSUNSGDAAICAV

XRZZWCDXOVZZJKQ

VOVCROMCJTOLXYU

HUVRYFREENDOBBGC

AQJBVXJCAJLEMAU

ONJORIFCGAUGIRN

PJGCHBDQIWJJTMQ

IQYQHKKBNBVDFPV

JJLHWPZAYZIGGEH

IGJZRMAAWJBESSS

JXZFRIEMDOVZRBJY

IRPWYIRJISLFVFF

How would you describe the motif?

UBSWWDFKLWPRHUC

JBDPXBDVEJVMBKK

VBDWNLROFVUBFFW

OWI
FTIENDO
XJXIOB

AUAAOOXZAABZSBT

MUAWSNTVZXSFHMI

LFQRKUT
FRIENDO
V

LOMTPOQHJVYYMFJ

LWGJMVPKYOZNMSA

RUPMFOHPVSPPVPT

BAZXVFTPQFQJVBM

HLPMOKUOX
GRIENDO

INUSUNSGDAAICAV

XRZZWCDXOVZZJKQ

VOVCROMCJTOLXYU

HUVRY
FREENDO
BBGC

AQJBVXJCAJLEMAU

ONJORIFCGAUGIRN

PJGCHBDQIWJJTMQ

IQYQHKKBNBVDFPV

JJLHWPZAYZIGGEH

IGJZRMAAWJBESSS

JXZ
FRIEMDO
VZRBJY

IRPWYIRJISLFVFF

Finding motifs


Several algorithms



Finding motifs


Variant on PROJECTION algorithm (
Tompa

&
Buhler, 2001) used in (
Shanabrook

et al.,
2010)


Only example of motif extraction in educational
data mining so far




Big idea


For each character string C that could be a
motif example (e.g. all character strings of
desired length)


Create a set of
projections,
random variations of C
that vary in one or more ways






Big idea


For each pair of strings C1 and C2, see how many
overlaps there are between their projection
matrices



Take the pair with the most matches and combine
into a motif


Creating multi
-
example motif if 3+ get added together



Repeat until goal number of motifs is found, or
until new motif is below criterion goodness





Goodness


Typically, likelihood is used




Motif in Education


Short, recurring pattern of behaviors in
a
sequence of
behaviors occurring
over time



Written as letters in
Shanabrook

et al. (2010)

Detail for education


How do you segment student behavior?



Could use student’s interaction on an entire problem, and
compute letters across whole problem


Might make more sense in tutors with shorter problems (e.g.
ASSISTments
)



Could use student’s interaction on an entire problem, and
define letters differently for context within
whole problem


Approach used by
Shanabrook

et al. (2010)



Could use “sliding window” of N actions

Behaviors in
Shanabrook

et al.


“hints
(a, b, c)


Hints is a measure of the number of hints viewed
for this problem.
Although
each problem has a maximum number
of hints, the hint count does not have
an
upper bound because
students can repeat hints and the count will increase at each

repeated
view. The three categories for hints are: (a) no hints,
meaning that
thestudent

did not use the hint facility for that
problem, (b) meaning the student used the
hint
facility, but was
not given the solution, and (c) last hint solved, meaning that
the
student
was given the solution to the problem by the last hint. As
described above,
this
metric combines two values logged by the
tutor: the count of hints seen, and
an indicator
that the final hint
giving the answer was seen. The data could have been
simply
binned low, medium, high hints; however, this would have missed
the
significance
of zero hints and using hints to reveal the problem
solution
.”

Behaviors in
Shanabrook

et al.



secFirst

(d, e, f)


The seconds to first attempt is an
important measure as it is during
this
time that the student
is reading the problem and formulating their response. In
previous
research [6], five seconds was determined to be a
threshold for this metric
representing
gaming: students
who make a first attempt in less than five seconds are
considered
not working on
-
task. We divide
secFirst

into
three bins: (d) less than 5 sec,
(
e) 5 to 30 sec, (f) greater
than 30 sec. (d) represents students who are gaming the
system
, (e) represents a moderate time to the first attempt,
(f) represents a long time to
the
first attempt. The cut at 30
seconds was chosen because it equalizes the distribution
of
bins (e and f), representing a division between a moderate
and a long time to the
first attempt.”

Behaviors in
Shanabrook

et al.



secOther

(g, h, i, j, k)


This variable represents actions related to
answering the
problem
after the first attempt was made. While the first
attempt includes the problem
reading
and solution time, subsequent
solution attempts could be much quicker and the
student
could still be
making good effort.
secOther

is categorized in five bins: (g) skip,
(
h) solved
on first, (i) 0 to 1.2 sec, (j) 1.2 to 2.9 sec, (k) greater than 2.9 sec. First,
there
are two categorical bins, skip and solve on first attempt. These are
each
determined
from an indicator in the log data for that problem.
Skipping a problem
implies
only that students never clicked on a correct
answer; they could have worked
on
the problem and then given up, or
immediately skipped to the next problem with
only
a quick look. Solved
on first attempt indicates correctly solving the problem. If
neither
of the
first two bins are indicated in the logs, then the
secOther

metric
measures
the mean time for all attempts after the first. The divisions of 1.2 sec and
2.9
sec
for the latter three bins were obtained using the mean and one
standard
deviation above
the mean for all tutor usage; (i) less than 1.2
seconds would indicate guessing
, (
j) would indicate normal attempts, and
(k) would indicate a long time between
attempts.”

Behaviors in
Shanabrook

et al.



numIncorrect



(o, p, q)
-

Each problem has four
or five possible answer choices, that
we
divide
into three groups: (o) zero incorrect attempts,
indicates either solved on first
attempt
, skipped
problem, or last hint solves problem (defined by
the other metrics);
(
p) indicates choosing the
correct answer in the second or third attempt,
and (q)
obtaining
the answer by default in a four
answer problem or possibly guessing when
there
is five answer problem
.”

What other constructs

could be used?


What other kinds of constructs could be used
for the atoms of motif analyses in educational
analyses?


At this grain
-
size (e.g. specific actions)

What other constructs

could be used?


What other kinds of constructs could be used
for the atoms of motif analyses in educational
analyses?


At other grain
-
sizes?

Common Motifs


{
adgo
,
adip
,
adiq
}



{
aeho
,
afho
}



{
ceho
}



{
adgo
,
aeho
}



{
aeiq

aeho

aeho

aekp

aeho

aeiq

aeho

aeip

aeho

aeip
}

Interpretations

(
Shanabrook

et al., 2010)


{
adgo
,
adip
,
adiq
}


gaming the system



{
aeho
,
afho
}


“This
student is using the tutor
appropriately, but not being challenged
.”



{
ceho
}


problem is too difficult



{
adgo
,
aeho
}


student is skipping problems



{
aeiq

aeho

aeho

aekp

aeho

aeiq

aeho

aeip

aeho

aeip
}


working on
-
task

Do you agree with interpretations?


{
adgo
,
adip
,
adiq
}


gaming the system



{
aeho
,
afho
}


“This
student is using the tutor
appropriately, but not being challenged
.”



{
ceho
}


problem is too difficult



{
adgo
,
aeho
}


student is skipping problems



{
aeiq

aeho

aeho

aekp

aeho

aeiq

aeho

aeip

aeho

aeip
}


working on
-
task

How can researchers form good
interpretations?


What other applications?


What other applications could motif
extraction be used for
in education?

Questions? Comments?


Asgn
. 8


Questions?


Comments?

Next Class


Monday, March 19


3pm
-
5pm


AK232



Association Rule Mining



Readings


Witten, I.H., Frank, E. (2005
)
Data
Mining: Practical Machine
Learning Tools and Techniques.

Section 4.5.


Merceron
, A.,
Yacef
, K. (2008) Interestingness Measures for
Association Rules in Educational Data.

Proceedings of the 1st
International Conference on Educational Data Mining
, 57
-
66.



Assignments Due:

None

The End