Sandeep Jayaprakash, Marist College

voltaireblingΔιαχείριση Δεδομένων

20 Νοε 2013 (πριν από 3 χρόνια και 8 μήνες)

212 εμφανίσεις

June 10
-
15, 2012


Growing Community;
Growing Possibilities

Josh Baron, Marist College

Sandeep

Jayaprakash
, Marist College

Nate Angell,
rSmart



Open Academic Analytics Initiative (OAAI)


Building the Predictive Model


Overview of the process


Data sets used and
d
ata extraction process


Overview of
Pentaho

and training process


Deploying the Predictive Model


Using
Pentaho

to score data


Performance of the predictive model


Producing Academic Alert Reports (AARs)


Overview of Intervention Strategies


Current Outcomes and Next Steps


“Creating an Open Ecosystem

for Learning Analytics”



OAAI is using two primary data sources:


Student Information System (SIS


Banner)


Demographics, Aptitude (SATs, GPA)


Learning Management System (LMS)


Event logs,
Gradebook


Goal: create
open
-
source

“early alert” system


Predict “at risk” students in first 3 weeks of a course


Deploy intervention to ensure student succeeds

Student Attitude Data
(SATs, current GPA, etc.)

Student Demographic
Data (Age, gender, etc.)

Sakai Event Log Data

Sakai
Gradebook

Data

Predictive
Model

Scoring

Identified
Students “at
risk” to not
complete
course

Static data

Dynamic Data

Intervention

Deployed

Model developed
w/ historical data


Purdue University’s Course Signals Project


Built on dissertation research by Dr. John Campbell


Ellucian

product that integrates with Blackboard


Students in courses using Course Signals


scored up to 26% more A or B grades


up to 12% fewer C's; up to 17% fewer D's and F‘s


Positive affect on four year retention rates


No Course Signals courses


69%


Two or more Course Signals courses
-

93%


Interventions that utilize “support groups”


Improved 1st and 2nd semester GPAs


Increase semester persistence rates (79% vs. 39%)


Building “open ecosystem” for Learning analytics


Sakai Collaboration and Learning Environment


Sakai API to automate secure data capture


Will also facilitate use of Course Signals & IBM SPSS


Pentaho

Business Intelligence Suite


OS data mining, integration, analysis and reporting tools


OAAI Predictive Model released under OS license


Predictive Modeling Markup Language (PMML)


Researching critical analytics scaling factors


How “portable” are predictive models?


What intervention strategies are most effective?


Release Sakai Academic Alert System (beta)


Will be included as part of Sakai CLE release


Conducted real world pilots


36 courses at community colleges


36 courses at HBCUs


Research finding related to…


Strategies for effectively “porting”

predictive models


The use of online communities and OER to impact
on course completion, persistence and content
mastery.


Wave I EDUCAUSE

Next Generation Learning

Challenges (NGLC) grant


Funded
by
Bill and Melinda

Gates
and
Hewlett Foundations


$250,000 over a 15 month period


Began May 1, 2011, ends January 2013
(extended)

Overview of the process

Data sets
and
data extraction
Overview
of
Pentaho

and
training process

2012 Jasig Sakai Conference

9

Development and initial deployment of an
“open source” predictive model of academic
risk



Methodological framework for model
development



Empirical analysis of predictive performance
(preliminary results)

2012 Jasig Sakai Conference

10


Data Integration using
Pentaho

Kettle


Data Extraction phase


Transformation phase


Load phase


Predictive Modeling using
Pentaho

WEKA



Training phase



Testing phase

2012
Jasig

Sakai Conference

11

2012 Jasig Sakai Conference

12

Personal (Bio) Data

Course Data

Performance Data

CMS (Sakai) event data

Partial grades


(
g
radebook
) data

ETL Layer


Anonymized


Data by

Institution,

Semester,

Program

(Grad/
Ugrad
),

Course,

Student


Target feature

Partition

Training

Data

Test

Data

Balance

Library of


predictive


models

(Classifiers)


Balanced

Training

Data

Train

Test

Store

New

Data

Predict

(Score)

Results

Software Platform: IBM SPSS Modeler,
Pentaho

Weka
,
Pentaho

Kettle

Predictive Modeling Layer: Training Testing and Scoring

Hardware Platform: IBM x3400
-

Xeon E5410 2.33 GHz, Quad
-
Core, 64 bit, 10GB RAM

OS: Windows Server 2008 Standard Edition

Source Data

Source Data

Algorithms

Software
Platform:


SQL

Server

2008 R2


Pentaho


Kettle

Data

Pre
-
processing
:

missing

values,


outliers,

derived

features


SQL queries to extract grade and user event
data from Sakai CLE (see
Sakai wiki

for
details)


Ensure access to historical data: data
warehouse,
backups etc.


Extract from backup to ensure no impact on
production performance


Encrypting user IDs for user
anonymization


2012 Jasig Sakai Conference

13


Data mining and predictive modeling are affected
by input data of diverse
quality


A predictive model is usually as good as its training
data


Good: lots of data


Not so good: Data Quality Issues


Not
so good: Unbalanced classes (at Marist, 6% of
students at risk. Good for the student body, bad
for training predictive models


)

2012 Jasig Sakai Conference

14


Variability in instructor’s assessment criteria


Variability in workload criteria


Variability in period used for prediction (early
detection)


Variability in Multiple instance data (partial
grades with variable contribution, and
heterogeneous composition)



Solution: Use ratios


Percent of usage over
Avg

percent of usage per
course


Effective Weighted Score /
Avg

Effective Weighted
Score

2012 Jasig Sakai Conference

15


Variability in Sakai tools usage


No uniform criterion in the use of CMS tools
(faculty members are a wild bunch

)


Tools not used, data not entered, too much
missing data

2012 Jasig Sakai Conference

16


Forums Content Lessons Assigns
Assmnts

2012 Jasig Sakai Conference

17

Personal (Bio) Data

Course Data

Performance Data

CMS (Sakai) event data

Partial grades


(
g
radebook
) data

ETL Layer


Anonymized


Data by

Institution,

Semester,

Program

(Grad/
Ugrad
),

Course,

Student


Target feature

Partition

Training

Data

Test

Data

Balance

Library of


predictive


models

(Classifiers)


Balanced

Training

Data

Train

Test

Store

New

Data

Predict

(Score)

Results

Software Platform: IBM SPSS Modeler,
Pentaho

Weka
,
Pentaho

Kettle

Predictive Modeling Layer: Training Testing and Scoring

Hardware Platform: IBM x3400
-

Xeon E5410 2.33 GHz, Quad
-
Core, 64 bit, 10GB RAM

OS: Windows Server 2008 Standard Edition

Source Data

Source Data

Algorithms

Software
Platform
:


SQL

Server

2008 R2


Pentaho


Kettle

Data

Pre
-
processin
g:

missing

values,


outliers,

derived

features


Fall 2010 data sample of undergraduate
students


Datasets were joined and data was cleaned,
recoded, and aggregated to produce an input
data file of 3877 records corresponding to
courses taken by students.



2012 Jasig Sakai Conference

18

Feature Type
Feature Name
Predictors
GENDER, SAT_VERBAL,
SAT_MATH, APTITUDE_SCORE,
FTPT, CLASS, CUM_GPA,
ENROLLMENT, ACADEMIC
_STANDING, RMN_SCORE,
R_SESSIONS, R_CONTENT_READ
Target
ACADEMIC_RISK
(1
=
at
risk;
0
student in good standing)

Use
Weka

3.7 and IBM SPSS Modeler 14.2.


Generate 5 different random partitions (70%
training, 30% testing)


Balance each training dataset


Train a predictive model (Logistic Regression, C4.5
Decision Tree, SVM) for each balanced training
dataset


5 x 3 = 15 models


Measure predictive performance of classifiers


recall, precision, specificity


Produce summary measures (mean and standard
error)


2012
Jasig

Sakai Conference

19


At model training time:

o
For all students in all courses compute their effective weighted score
as
sumproduct
(partial scores, partial weights) * (1 / sum partial
weights)

o
Compute the effective
Avg

Weighted score for the course

o
Calculate the ratio as:

RMN_SCORE =
effectiveWeighted

Score
/ effective
Avgweighted

score





At testing time:

o
For all students in the course tested compute their effective weighted
score

o

as
sumproduct
(partial scores, partial weights) * (1 / sum
partialwieghts
)

o
Compute the effective
Avg

Weighted score for the course

o
Calculate the ratio as:

RMN_SCORE = effective Weighted
score
/ effective
Avgweighted

score

2012 Jasig Sakai Conference

20

2012
Jasig

Sakai Conference

21

Classif.
Perform.
Algorithm
Metrics
Train
Test
Train
Test
Train
Test
Train
Test
Train
Test
Train
Test
Train
Test
5089
1182
5041
1146
4952
1207
5014
1159
5095
1134
Accuracy
93.40%
90.52%
92.66%
90.49%
91.54%
88.73%
92.98%
89.91%
93.27%
90.48%
92.77%
90.03%
0.74%
0.77%
Recall
95.91%
86.15%
93.94%
77.46%
92.64%
86.30%
94.51%
77.78%
95.83%
85.29%
94.57%
82.60%
1.37%
4.56%
Specificity
90.85%
90.78%
91.43%
91.35%
90.47%
88.89%
91.50%
90.71%
90.76%
90.81%
91.00%
90.51%
0.45%
0.94%
Precision
91.42%
35.22%
91.36%
37.16%
90.46%
33.33%
91.46%
35.67%
91.03%
37.18%
91.14%
35.71%
0.42%
1.59%
Accuracy
99.96%
94.59%
99.64%
95.46%
99.94%
94.61%
99.96%
94.39%
99.65%
94.80%
99.83%
94.77%
0.17%
0.41%
Recall
100.00%
66.15%
99.39%
61.97%
100.00%
58.90%
100.00%
55.56%
99.40%
54.41%
99.76%
59.40%
0.33%
4.80%
Specificity
99.92%
96.24%
99.88%
97.67%
99.88%
96.91%
99.92%
96.96%
99.88%
97.37%
99.90%
97.03%
0.02%
0.54%
Precision
99.92%
50.59%
99.88%
63.77%
99.88%
55.13%
99.92%
54.79%
99.88%
56.92%
99.90%
56.24%
0.02%
4.81%
Accuracy
91.26%
89.00%
89.59%
90.23%
90.15%
88.48%
91.12%
88.96%
90.30%
91.01%
90.48%
89.54%
0.70%
1.05%
Recall
92.98%
86.15%
89.09%
88.73%
90.80%
90.41%
92.07%
83.33%
91.07%
89.71%
91.20%
87.67%
1.46%
2.91%
Specificity
89.50%
89.17%
90.06%
90.33%
89.51%
88.36%
90.21%
89.33%
89.55%
91.09%
89.77%
89.65%
0.34%
1.06%
Precision
90.00%
31.64%
89.63%
37.72%
89.41%
33.33%
90.06%
34.09%
89.51%
39.10%
89.72%
35.18%
0.29%
3.12%
SE
SVN
C5.0
Logistic
Reg.
1
2
3
4
5
Mean

Logistic regression and SVM did much better
that C5.0 / J4.8


Detect 82% to 87% of the student population at
risk.


In comparison, recall of C5.0 / J4.8: 59% (why so
low?)



False positives:


10% of false positives over Ok students (C5.0 /
J4.8 does better: 3%)


65% of predictions are false alarms (C5.0 / J4.8
does better: 44%)


2012 Jasig Sakai Conference

22


For logistic regression


RMN_SCORE


ACADEMIC_STANDING CUM_GPA


Then R_SESSIONS and SAT_VERBAL



For the SVM classifier


RMN_SCORE


CUM_GPA, ACADEMIC_STANDING, R_SESSIONS and
SAT_VERBAL



C5.0/J4.8


Minimal difference among predictors

2012 Jasig Sakai Conference

23


Results are encouraging, although the number of
false alarms raises some concern




Differences among classifiers, in particular DTs
(typically very robust classifiers), requires further
investigation.



Data quality (missing values) remains an open
issue with partial remediation



Partial
-
grades
-
derived score (RMN_SCORE)
remains as the best predictor.



CMS generated events appear to be second tier
predictors

2012 Jasig Sakai Conference

24

Using
Pentaho

to score data

Performance of the predictive model

Producing Academic Alert
Reports

2012
Jasig

Sakai Conference

25


ETL phase



Remains similar to the ETL process used for
training the model except for the records have
missing data are also retained



Scoring phase

Utilizes WEK


Scoring

plugin

to

embed

WEKA
predictive

model

into

Pentaho

Kettle



Reporting phase

Pentaho

Report

Designer

tool

to

create

a

template

for

reporting.

2012 Jasig Sakai Conference

26

2012 Jasig Sakai Conference

27

Personal (Bio) Data

Course Data

Performance Data

CMS (Sakai) event data

Partial grades


(
g
radebook
) data

ETL Layer


Anonymized


Data by

Institution,

Semester,

Program

(Grad/
Ugrad
),

Course,

Student


Target feature

Partition

Training

Data

Test

Data

Balance

Library of


predictive


models

(Classifiers)


Balanced

Training

Data

Train

Test

Store

New

Data

Predict

(Score)

Results

Software Platform: IBM SPSS Modeler,
Pentaho

Weka
,
Pentaho

Kettle

Predictive Modeling Layer: Training Testing and Scoring

Hardware Platform: IBM x3400
-

Xeon E5410 2.33 GHz, Quad
-
Core, 64 bit, 10GB RAM

OS: Windows Server 2008 Standard Edition

Source Data

Source Data

Algorithms

Data

Pre
-
processi
ng:

missing

values,


outliers,

derived

features


Software
Platform
:


SQL

Server

2008 R2


Pentaho


Kettle


2012 Jasig Sakai Conference

28

Awareness Messaging

Online Academic Support
Environment (OASE)

2012 Jasig Sakai Conference

29


Researching effectiveness of two strategies


Awareness Messaging


Online Academic Support Environment (OASE)

2012 Jasig Sakai Conference

30


OER Content


Self
-
Assessments


Learning
S
kills
-

Flat World
Knowledge


Learning Support
Facilitation &
Mentoring

Initial research findings

Future efforts

2012 Jasig Sakai Conference

31


Applied
similar analytical techniques
to those used by Campbell at Purdue,
using Fall 2010
data


Marist College and Purdue University


Differences (institutional type and
size)


Similarities: % students receiving
federal Pell Grants, % ethnicity,
ACT composite ACT composite
25th/75th percentile


We found similarities in correlation
values
.


As
in the case of Purdue, all these
metrics are found to be significantly
correlated with course grade, with
rather low correlation values
.

2012
Jasig

Sakai Conference

32


Marist
Fall 2010
N=18968
N=27276
Correlation
0.147
Significance
0.000(**)
N
11195
Correlation
0.098
0.112
Significance
0.000(**)
0.000(**)
N
7651
19205
Correlation
0.133
0.068
Significance
0.000(**)
0.000(**)
N
1552
7667
Correlation
0.233
0.061
Significance
0.000(**)
0.000(**)
N
1507
7292
Correlation
0.146
0.163
Significance
0.000(**)
0.000(**)
N
3245
4309
Correlation
0.161
0.238
Significance
0.000(**)
0.000(**)
N
1423
4085
(**) Significant at the 0.01 level (2-tailed)
Marist data uses ratios over course mean
instead of frequencies
(no values
reported)
Campbell
(2007)
Undergraduate CMS
event frequencies
Assign.
Submitted
Asssmts
Submitted
Course Grade
Sessions
Opened
Content
Viewed
Discussions
Read
Discussions
Posted

Initial review of instructor “research logs”
showed general agreement with predictions


Faculty/student feedback has been positive

2012 Jasig Sakai Conference

33

"Not only did this project directly assist my students by guiding students
to resources to help them succeed, but as an instructor, it changed my
pedagogy;
I became more vigilant about reaching out to individual
students

and providing them with outlets to master necessary skills
.


P.S. I have to say that this semester,
I received the highest volume of
unsolicited positive feedback from students
, who reported that they felt I
provided them exceptional individual attention!


Develop and release “student effort
data”API


Develop a Sakai Academic Alert dashboard


Create customized predictive models for
different academic contexts


Work to facilitate use of SNAPP with Sakai

2012

Jasig

Sakai Conference

34

2012

Jasig

Sakai Conference

35


Sakai Confluence Wiki


Open Academic
Analytics Initiative (OAAI)

https
://
confluence.sakaiproject.org/pages/viewpage.action?pageId=75671025




Contact


Josh Baron


Senior Academic Technology Officer ,
Marist College



josh.baron@marist.edu



Sandeep

M.
Jayaprakash



Technical Consultant
OAAI, Marist College

sandeep.jayaprakash1@marist.edu


2012

Jasig

Sakai Conference

36

2012

Jasig

Sakai Conference

37

2012 Jasig Sakai Conference

38

2012 Jasig Sakai Conference

39

2012 Jasig Sakai Conference

40


2012
Jasig

Sakai Conference

41

Data Sets

Data Extraction

(course event data
aggregated, student data
added, student identity
removed)

Banner

(ERP)





Sakai

Student Data

(Demographics &


Course enrollment)

Course Event Data

Data Pre
-
processing

(missing values,
outliers, incomplete
records, derived
features)

Identifying student information is
removed during the data extraction
process

2012 Jasig Sakai Conference

42


2012 Jasig Sakai Conference

43






SVM/SMO

Logistic

DT J4.8

NBayes

Knowledge Flow

Filter

Balance

Partition

2012 Jasig Sakai Conference

44

Modeler

Logistic

DT C5.0

SVM

Balance

Partition

Filter


Balance the training dataset


Subsample


( Reduce the dataset)


Oversample



(Duplicate records if it’s a minority)


SMOTE


Nitesh

V
Chawla
. Et.al(2002) Synthetic Minority Over
Sampling Technique. Journal of Artificial Intelligence Research.
16.321
-
357.







2012 Jasig Sakai Conference

45


In the case of unbalanced classes, Accuracy is a poor measure


Accuracy

= (TP+TN) / (TP+TN+FP+FN)


The large class overwhelms the metric


Better Metrics:


Recall

= TP / (TP+FN) Ability to detect the class of
interest


Specificity
= TN / (TN+FP) Ability to rule out the
unimportant class


Precision

= TP / (TP+FP) Ability to rule out false
alarms


Confusion Matrix


2012 Jasig Sakai Conference

46

2012 Jasig Sakai Conference

47

Logistic Regression

C4.5/C5.0 Boosted Decision Tree

Support Vector machines

2012 Jasig Sakai Conference

48

2012 Jasig Sakai Conference

49

2012 Jasig Sakai Conference

50