Science INFO 480

skillfulbuyerUrban and Civil

Nov 16, 2013 (3 years and 7 months ago)

59 views

Introduction to Data
Science


INFO 480


Drexel University’s
iSchool

Sean P. Goggins, PhD

April 2, 2013

Week One

What is “Data Science”?


Data scientists are story tellers.

Variables?

Time

Number of Participants

Time, Number of
participants

Number of new
participants,

Number of
sustained
participants

Story Telling

Story Telling

Story Telling

What is Data Science?


Storytelling


Database Theory



How you organize your
data has a big influence on what you can
do with it.


Agile Manifesto



Key thing is iterative
development; it’s a technology value
system.


Spiral Dynamics



What we view as fact and
what we desire emerges from the data
presented to us.

Credit:
http://www.datascientists.net/what
-
is
-
data
-
science


Database Theory


Relational Algebra & Set Theory


Thinking in relations helps you to connect
disparate data;


What is the connecting field?


What is the cardinality?


Set Theory Helps you think about
summarizing data


What time period? Weeks? Months?


By person? By Group? By Geography?

Agile Manifesto


Individuals and interactions over processes
and tools


Working software over comprehensive
documentation


Customer collaboration over contract
negotiation


Responding to change over following a
plan

http://www.agilemanifesto.org


Spiral Dynamics

New research unveiled at this
year’s AERA conference
documents a disturbing trend
among the nation’s
secondary schools: Between
2001 and 2012, high school
graduation rates regularly
spiked in late May and early
June, ballooning from near
zero to a staggering average
of 78 percent.

What you’ll need for this
course


Interest in learning data analysis tools


R


Python


Curiosity


A laptop to bring to class (see me if this is a problem)


Persistence


A Github Account


Willingness to do weekly
homeworks

and participate
in online iteration of data products you and your
course mates develop


A
dropbox

account will be helpful


An Outline of The Class


Two Books


Janert
, PK. (2011)
Data Analysis with Open
Source Tools
. New York: O’Reilly.


Vaidhyanathan
, S. (2005)
The Anarchist in the
Library: How the Clash Between Freedom
and Control is Hacking the Real World and
Crashing the System.
New York: Basic Books
.


Ten Weeks



Schedule

Activity One


Academic Integrity Form at the Back of The
Syllabus


15 Minutes to Work on Tool Configuration


R Download


Python Download &
Config


http://seangoggins.net/
info480


Github Account and Client


http://mac.github.com


http://windows.github.com





Basic R Instructions

Installation


Download and install R


Run the “R
-
Libraries.R
” script from the root of
the
github

project directory

Basics

Google Trends


2005


Present


Produce a Trend graph for Google search
phrases.


Identify four search phrases. Describe what
makes these search phrases a coherent
comparison in two sentences. What do they have
in common? How do they provide a useful
contrast?


BEFORE you run the search, write down what you
expect the trends to look like. Spikey, trending
upward, trending downward?


Examine the resulting changes


What did you find? What are some theories that
might explain the similarities or differences you
observed

Network Analysis

Telling Stories: A
Visualization of Purely
Qualitative Data

You CAN do that without Quantitative Data

You can do it with Qualitative Data

And A LOT of Quantitative Data REQUIRES qualitative Analysis

Idealized Distributed Team

Actual Models Found

The Different ICT Roles

Organizational Evolution

Network Analysis


Github
Activities

Underpants Gnomes

With much discourtesy from the US TV Program “South Park”

Underpants Gnomes

Addressing The Underpants
Gnome Postulate

Discussion
Post


Read


Response

Classification


Open Coding


Axial Coding

Identification
of
Coordination
Events


Time proximity


Topical proximity

Aggregation
of Posts by
Topic

Weighted
Network
Analysis of
Interactions

Methodological Approach

Weight Connections Based on
Time Distance, Grouped

By Topic and informed by analysis
of time distance between posts.


Identify Key

Information

Brokers

28

Data: Github

Github Network Activity
One

Actual R


Code


Work through setup


Scripts are ready to run


Talk Through Them and Walk around
to help

Further Analysis Tools


Eight
Mylyn

Releases (Temporal Analysis)


R Packages Used


TNET


iGraph


Statnet


Weighted Network: TNET

The Dense Graph (Work)


Developers create a dense graph. Not a
complete graph, but dense.

Work

A Sparser Graph (Talk)


Commenter's create a sparse graph

Talk

Release One (2.0) Analysis

Code

Discussion

Work

Talk

iGraph

STATNET for Discussion


StatNet

Red
= Bug Commenter

Blue
= Bug Opener

StatNET

Talk

Release One

Work & Talk

Release 1 (2.0)
iGraph

&
Statnet

Talk

Clusters

In Degree
&

Out
Degree

Red
= Bug Commenter

Blue
= Bug Opener

iGraph

StatNET

Google
Summer
Coder

Release One (2.0): Filtered

Code

Discussion

304, 373, 399 & 143 form

The Strongest Connections

In both networks

Red
= Bug Commenter

Blue
= Bug Opener

Talk

Work

Release One (2.0): Filtered

Code

Discussion

304, 373, 399 & 143 form

The Strongest Connections

In both networks

Red
= Bug Commenter

Blue
= Bug Opener

Google
Summer
Coder

Talk

Work

457, 391 & 159



Comment & Open

Compare Over Time

First & Last Release

Release 1 (2.0) Compared to Release 8
(3.3)

Talk

304, 399,
143
, 159,
173
,
373

399,
118
, 304, 159,
391, 416

StatNET

& ordinary plotting

Release 1 (2.0) Compared to Release 8
(3.3)

Work

Two disconnected

Graphs in release 8

304, 373, 399 & 143

143 & 304 disengaged

Or missing entirely

iGraph

Release Eight

Work & Talk

Release 8 (3.3): Filtered

Code

Discussion

Red
= Bug Commenter

Blue
= Bug Opener

Talk

Work

Nobody is

“Just Blue”

Release 8 (3.3): Filtered

Code

Discussion

Red
= Bug Commenter

Blue
= Bug Opener

Talk

Work

Notice 416 in Talk & Second Coder Graph

Talk

Clusters

In Degree
&

Out
Degree

Red
= Bug Commenter

Blue
= Bug Opener

iGraph

StatNET

Release 8 (3.3)
iGraph

&
Statnet

399, 118 & 159 are significant, But play with different clusters of Other
people.

Blue

Cluste
r

Releases One


䕩杨E

High Level Views Over Time

Discussion, Releases 1


8

Where there is no color,

There are multiple, incomplete

Graphs.

Code, Releases 1


8

One Possible explanation:

A few central

People who slowly but

Observably begin to engage

Other contributors in

An open source software

Development project.


Structure evolves

Key Groups Evolve

iGraph

Github: Longitudinal

Github
Statnet

Summaries


Built with


This R Code

Longitudinal Network
Libraries

GGPLOT2 Code

pc <
-

ggplot
(
gMerged
,
aes
(
timeperiodNum
,
inDegree
-
outDegree
,
group=person, size=betweenness))


pc <
-

pc +
geom_line
(
aes
(
colour

= person)) +
guides(
colour
=
guide_legend
(
nrow
=7)) + opts(
legend.text

=
theme_text
(
colour
="blue", size=20),
legend.title
=
theme_text
(size=24),
legend.position
="top",
axis.text.x
=
theme_te
xt
(size=20),
axis.text.y
=
theme_text
(size=20),
axis.title.x

=
theme_text
(size=20),
axis.title.y

=
theme_text
(size=20, angle=90),
panel.background

=
theme_rect
(fill = "white",
colour

= NA)) +
ylab
("In
Degree MINUS Out Degree")+
xlab
("Time Period Number")


Github Longitudinal

GGPLOT2 Code

pc <
-

ggplot
(
gMerged
,
aes
(
timeperiodNum
,
inDegree
-
outDegree
,
group=person, size=betweenness))


pc <
-

pc +
geom_line
(
aes
(
colour

= person)) +
guides(
colour
=
guide_legend
(
nrow
=7)) + opts(
legend.text

=
theme_text
(
colour
="blue", size=20),
legend.title
=
theme_text
(size=24),
legend.position
="top",
axis.text.x
=
theme_te
xt
(size=20),
axis.text.y
=
theme_text
(size=20),
axis.title.x

=
theme_text
(size=20),
axis.title.y

=
theme_text
(size=20, angle=90),
panel.background

=
theme_rect
(fill = "white",
colour

= NA)) +
ylab
("In
Degree MINUS Out Degree")+
xlab
("Time Period Number")


GITHUB: Types of Work

Code & Data for Types of
Work

gg

<
-

ggplot
(
pullerListTyped
,
aes
(
timeperiodNum
,
fill=person))+
geom_bar
()+
facet_grid
(facets=
codeCode
~.)

ggsave
("fix
-
feature
-
pullers.png
", dpi=600, plot=
gg
, width=12,
height=12)

Practice Activities


Our First Python Program


Network Analysis in R


https
://github.com/The
-
Art
-
of
-
Big
-
Social
-
Data/
info480



Week 1 folder contains a python script that can be
opened in
iPython
, following the instructions at
http://www.seangoggins.net/info480


Github

Networks Folder contains two activities

upcoming


No Lecture next week


Three Readings for next week


I’ll setup an online discussion forum using
GitHub

to focus a discussion. Participation is required. I
don’t count posts, I don’t count words; I evaluate
thoughtfulness


Selecting a
DataSet



By This Friday I will post
data sets with analysis questions. Your task will
be to think through how the data is organized,
and describe *in words* how that data needs to
be reshaped to answer the analysis questions.

Week Three


Assignment One


Prepared Data #1 (Transform an instructor supplied data
set into an analyzable form using open source tools)


Instructor supplies a public data set for analysis,
along with a set of questions or insights that are
obtainable from the data set


Included in the assignment is a 500 word
explanation of how the preparation of the data
alters the original data and a description of how the
questions can be answered using the data.

(reshaping)


Transform the data using whatever tools you are
familiar with to be in the “shape” required for
analysis (Assignment 2 you will have to share a
script used for the transformation; so if you use a tool
instead of a language the first time, you will need to
write a script to do the transformation for assignment
2)