in the Digital Age:

savagelizardΤεχνίτη Νοημοσύνη και Ρομποτική

25 Νοε 2013 (πριν από 3 χρόνια και 8 μήνες)

70 εμφανίσεις

H
-
Net and Scholarly Discourse

in the Digital Age:

A New Approach to Data Mining
Email Discussion Lists

William Punch

Mark Lawrence Kornbluh

Wayne Dyksen

Michigan State University


November 5, 2006

H
-
Net and Scholarly
Discourse in the Digital Age



Opportunity & Challenge

Searching Large, Text Archives


New Approach

Semantic
-
Augmented Consensus Clustering


Application

H
-
Net Discussion Lists

IT Communication Revolution


People:


Few





Many




Experts




Everyone


Speed:


Slow




Instant


Quantity:

Small




Vast


Style:


Long





Short


Location:

Limited



Everywhere


Lifetime:

Short




Forever


Impact on Scholarly
Communication

New…


Forms of Interactivity


Trans
-
Disciplinary Communities


Participants


Producers


Consumers


Levels of Democratization of Information

Electronic Archives


Mostly Text Based


Exponential Growth


Not Catalogued or Catalog
-
able


Little or No Metadata


Untapped Value


Current Users


Future Scholars




Opportunity & Challenge

Large Text Archive

Information

Knowledge

Typical Document Search

Data


Words and Phrases


Boolean Combinations


Automatic

(“Unsupervised”)


Not Sufficient


Too Little


Too Much

Metadata


Keywords & Annotations


Classifications


By Hand

(“Supervised”)


Not Scalable


1M Messages


3GB Text

Our Research


On Large Text Archives


Organization


Exploration


Develop and Test


New Techniques


New Tools


Interdisciplinary

Large Text Archive

Information

Knowledge

H
-
Net and Scholarly
Discourse in the Digital Age



Opportunity & Challenge

Searching Large, Text Archives


New Approach

Semantic
-
Augmented Consensus Clustering


Application

H
-
Net Discussion Lists

Two Approaches

Very broadly there are two approaches
we could use to aid a user in finding
documents in a large set:



Classification


Clustering


The Two Spiral Problem

Our little example. How to discriminate
the two intertwined spirals.

=

+

Classification

Given k classes, find the best class in
which to place a particular example

Typically two stages:


Train the algorithm on examples from
the k classes


See how well the algorithm does on
placing an unknown into the correct
class




Classification Example

Class 1

Class 2

Algorithm

Train




Test


Trained Algorithm

?

Supervised

Classification is a supervised process. We
know the k classes (or we have a good
idea) so we make the algorithm work
properly on examples, then test how
well it learned by testing it with
unknowns.

Clustering

Slightly different. Given a set of examples,
find the “best” partitioning into k sets of
those examples.

Also two stages:

1.
Cluster the examples, we provide k

2.
Measure somehow how well
separated the examples are.

Example


Algorithm

Unsupervised

There is typically no training in clustering.
We choose where to put a point based on
some criteria of “closeness”.


As you can see, that can be hard to
measure.

Document Clustering

Our approach is to cluster documents
(instead of points in a spiral) based on
documents that are “close” to each other
in meaning.


The result should be sets of documents
that have something in common,
especially if the process is user
influenced.

Three general problems we will
address


Consensus clustering


Semantic distance measure


Semi
-
supervised user influence on the
clustering process

One: Consensus Clustering

Two basic problems:

1.
No one measure of “closeness” is
often sufficient to get good clusters.
Should be a combination of many
such measures

2.
On large document sets, any algorithm
is likely expensive. However, if done
on smaller subsets of the overall set,
much cheaper

Example

Simplest clustering algorithm ever
invented! Draw a random line through
the cluster space. One side is cluster 1,
the other side cluster 2.

And the results ….


Um, so why?

1.
The algorithm is cheap, very cheap!
Draw a line through the “space”.
Cheap is good when you are worried
about large numbers.

2.
It turns out that multiple applications,
each poor, when taken together in
consensus give very good results!

3.
Multiple “measures” can be accounted
for this way.


Two: Semantic Distance

One distance measure we would like to
add to the consensus is semantic
distance. How close semantically are two
documents?


How to do this cheaply?

Wordnet


Started by George Miller Princeton
(“The magical number 7 plus or minus
2”) in 1985. Funded to study machine
translation.


Is much more than just a dictionary. It is
an ontology (in CS, that means a data
model) of English.


It includes relationships such as:
hypernym, hyponym, meronym,
holonym, synonym, antonym, etc.

Use Wordnet to find semantic
distance

How close are “dog” and “cat”?

dog:


sense 1: domestic dog


sense 2: unattractive girl


sense 3: lucky man


sense 4: a cad


sense 5: hot dog


sense 6: hinged catch


sense 7: andiron

hypernym

canine:


sense 1: tooth


sense 2: family Canidae


hypernym

carnivore:


sense 1: meat eater



hyponym

cat:


sense 1: true cat


sense 2: guy


sense 3: spiteful woman


sense 4: tea


sense 5: whip


sense 6: truck


sense 7: lions


sense 8: tomography

feline:


sense 1: felid



hyponym

Semantic Relationship Graphs


Ultimately will find graphs of “close word
senses” and use them to represent a
document

The Text


Another problem was to make governments strong enough to
prevent internal disorder. In pursuit of this goal, however, rulers
were frustrated by one of the strongest movements of the eleventh
and twelfth centuries: the drive to reform the Church. No
government could operate without the participation of the clergy;
members of the clergy were better educated, more competent as
administrators, and usually more reliable than laymen.
Understandably, the kings and the greater feudal lords wanted to
control the appointment of bishops and abbots in order to create a
corps of capable and loyal public servants. But the reformers
wanted a church that was completely independent of secular
power, a church that would instruct and admonish rulers rather than
serve them. The resulting struggle lasted half a century, from 1075
to 1122.

[6a]

Three: User Interaction

We want the use to be able to interact
with the clustering process in a natural
way (that is, not modify the algorithm).


We do this by allowing the use to
establish relationships between
documents:


must
-
link

(these docs go together)


must
-
not
-
link

(separate these docs)

Changing the algorithm

As a result of changing the way
documents cluster together, the user
changes the algorithm (because the
constraints he/she establishes must be
respected across all the documents) but
in a way they can understand.

H
-
Net and Scholarly
Discourse in the Digital Age



Opportunity & Challenge

Searching Large, Text Archives


New Approach

Semantic
-
Augmented Consensus Clustering


Application

H
-
Net Discussion Lists

H
-
Net


Humanities and Social Sciences OnLine


Pioneer, Peer
-
Edited Discussion Lists


160 Networks


600+ Editors


150,000 Participants


Global

H
-
Net Archives


Scholarly Value


Current Users


Future Scholars


Scale


1,000,000+ Messages


3GB of Text

Current Search Capabilities

By


Date


Author


Subject


Words in Text

What’s missing?


Multi
-
Thread


Multi
-
List


Cross
-
Temporal


Etc…

Example in H
-
Net


Movie
Amistad
was discussed across
H
-
Net networks


History, Literature, Film, Teaching,
Economics



Different perspectives


Over time

Value to H
-
Net


Locate related content


Across time


Across scholarly communities


Facilitate interdisciplinary scholarship
and teaching


Synthesize new knowledge in new
forms

Unlocking the Potential of
Scholarly Communication


Email and Forums


Popularity


Limitations


Adding depth and breadth while
maintaining immediacy



Value of Humanities Technology
Research


Fundamental challenge in computer
science


Humanities research
---

new
insights/new connections


H
-
Net provides testbed/testers


Truly interdisciplinary research




H
-
Net and Scholarly
Discourse in the Digital Age

Contact Information:


MATRIX: Center for the Humane Arts,
Letters, and Social Sciences On
-
Line


www.matrix.msu.edu