Artificial Intelligence and the

woodruffpassionateInternet και Εφαρμογές Web

21 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

57 εμφανίσεις

Artificial Intelligence and the
Internet

Edward Brent

University of Missouri


Columbia and Idea Works, Inc.

Theodore Carnahan

Idea Works, Inc.

Overview


Objective


Consider how AI
can

be (and in
many cases
is

being) used to enhance and
transform social research on the Internet


Framework


intersection of AI and research
issues


View Internet as a source of data whose
size and rate of growth make it important to
automate much of the analysis of data


Overview
(continued)


We discuss a leading AI
-
based approach, the
semantic web, and an alternative paradigmatic
approach, and the strengths and weaknesses of
each


We explore how other AI strategies can be used
including intelligent agents, multi
-
agent systems,
expert systems, semantic networks, natural
language understanding, genetic algorithms, neural
networks, machine learning, and data mining


We conclude by considering implications for future
research


Key Features of the Internet



Decentralized


Few or no standards for much of the
substantive content


Incredibly diverse information


Massive and growing rapidly


Unstructured data

The Good News About the
Internet



A massive flow of data


Digitized


A researcher’s dream

The Bad News



A massive flow of data


Digitized


A researcher’s nightmare

Data Flows


The Internet provides many examples of data flows.


A
data flow

is an ongoing flux of new information, often from
multiple sources, and typically large in volume.


Data flows are the result of ongoing social processes in which
information is gathered and/or disseminated by humans for the
assessment or consumption by others.


Not all data flows are digital, but all flows on the Internet are.


Data flows are increasingly available over the internet.


Examples of data flows include


News articles

Published research articles


eMail

Medical records


Personnel records

Articles submitted for publication


Research proposals

Arrest records


Birth and death records






Data Flows vs Data Sets


Data flows are fundamentally different from the data sets
with which most social scientists have traditionally worked.






A

data set

is a collection of data, often
collected for a specific purpose and over a
specific period of time
, then frozen in place.

A
data flow

is an ongoing flux of new
information, with
no clear end in sight
.

Data sets typically
must be created in
research projects

funded for that purpose in
which relevant data are collected, formatted,
cleaned, stored, and analyzed.

Data flows are the result of
ongoing social
processes

in which information is gathered
and/or disseminated by humans for the
assessment or consumption by others.

Data sets are
sometimes analyzed only once

in the context of the initial study, but are
often made available in data archives to
other researchers for further analysis.

Data flows
often merit continuing analysis
, not
only of delimited data sets from specific time
periods, but as part of ongoing monitoring and
control efforts.

The Need for Automating
Analysis


Together, the tremendous volume and rate
of growth of the Internet, and the prevalence
of ongoing data flows make automating
analysis both more important and more
cost
-
effective.


Greater cost savings result from automated
analysis with very large data sets


Ongoing data flows require continuing
analysis and that also makes automation
cost
-
effective

The Semantic Web


The
semantic web

is an effort to build into the World Wide
Web tags or markers for data along with representations of the
semantic meaning of those tags (Berners
-
Lee and Lassila,
2001; Shadbolt, Hall and Berners
-
Lee, 2006).


The semantic web will make it possible for computer programs
to recognize information of a specific type in any of many
different locations on the web and to “understand” the
semantic meaning of that information well enough to reason
about it.


This will produce interoperability


the ability of different
applications and databases to exchange information and to be
able to use that information effectively across applications.


Such a web can provide an infrastructure to facilitate and
enhance many things including social science research.


Implementing the Semantic
Web








Contemporary
Research

Possible Implementation of the Semantic Web

Coding scheme

XML Schema



a standardized set of XML tags used to markup web pages.

For example, research proposals might include tags such as <design>
<sampling plan> <hypothesis> <findings>

Coded data

Web pages marked up with
XML (extensible markup language)



a general
-
purpose markup language designed to be readable by humans while at the
same time providing metadata tags for various kinds of substantive content that
can be easily recognized by computers

Knowledge
representation

Resource Description Framework



a general model for expressing
knowledge as subject
-
predicate
-
object statements about resources

A sample plan in a research proposal might include these statements


Systematic sampling
-

is a
-

sampling procedure


Sampling procedure
-

is part of
-

a sampling plan

Theory

Ontology


a knowledgebase of objects, classes of objects, attributes
describing those objects, and relationships among objects

An ontology is essentially a formal representation of a theory

Analysis

Intelligent agents



software programs capable of navigating to relevant web
pages and using information accessible through the semantic web to perform
useful functions

AI Strategies and the
Semantic Web


Several components of the semantic web make
use of artificial intelligence (AI) strategies



Semantic Web
Component

Artificial intelligence and
related computational
strategies

Knowledge
representation

Object
-
Attribute
-
Value (O
-
A
-
V)
triplets commonly used in semantic
networks

Theory

Semantic network

Analysis

Intelligent agents, Expert systems,
Multi
-
agent models

Distributed computing, parallel
processing, grid

Strengths of the Semantic
Web


Fast and efficient to develop


Most coding done by web developers
one time

and used
by everyone


Fast and efficient to use


Intelligent agents can do most of the work with little human
intervention


Structure provided makes it easier for computers to
process


Can take advantage of distributed processing and grid
computing


Interoperability


Many different applications can access and use
information from throughout the web

Weaknesses of the Semantic
Web (Pragmatic Concerns)


Seeks to impose standardization on a highly
decentralized process of web development


Requires cooperation of many if not all developers


Imposes the double burden of expressing knowledge for
humans and for computers


How will tens of millions of legacy web sites be retrofitted?


What alternative procedures will be needed for
noncompliant web sites?


Major forms of data on the web are provided by
untrained users unlikely to be able to markup for the
semantic web


E.g., blogs, input to online surveys, emails,

Weaknesses of the Semantic
Web (Fundamental Concerns)


Assumes there is a single ontology that can be used for all
web pages and all users (at least in some domain).


For example, a standard way to markup products and prices in commercial web sites could make
it possible for intelligent agents to search the Internet for the best price for a particular make and
model of car.


This assumption may be inherently flawed for social research
for two reasons.


1) Multiple paradigms
-

What ontology could code web pages from
multiple competing paradigms or world views (Kuhn, 1969).


If reality is socially constructed, and “beauty is in the eye of the
beholder” how can a single ontology represent such diverse
views?


2) Competing interests


What if developers of web pages have
political or economic interests at odds with some of the viewers of
those web pages?




Paradigmatic Approach


We describe an alternative approach to the
semantic web, one that we believe may be more
suitable for many social science research
applications.


Recognizes there may be multiple incompatible
views of data


Data structure must be imposed on data
dynamically by the researcher as part of the
research process


(in contrast to the semantic web which seeks to build an
infrastructure of web pages with data structure pre
-
coded
by web developers)

Paradigmatic Approach
(continued)


Relies heavily on natural language processing
(NLP) strategies to code data.


NLP capabilities are not already developed for
many of these research areas and must be
developed.


Those NLP procedures are often developed and
refined using machine learning strategies.



We will compare the paradigmatic approach to
traditional research strategies and the Semantic
Web for important research tasks.

Example Areas Illustrating the
Paradigmatic Approach


Event analysis

in international relations


Essay grading


Tracking news reports

on social issues or for
clients


E.g., Campaigns, Corporations, Press agents




Each of these areas illustrate significant data flows.


These areas and programs within them illustrate
elements of the paradigmatic approach.


Most do not yet employ all the strategies.

Essay Grading


These are programs that allow students to submit essays
using the computer then a computer program examines
the essays and computes a score for the student.


Some of the programs also provide feedback to the
student to help them improve.


These programs are becoming more common for
standardized assessment tests and classroom
applications.


Examples of programs


SAGrader™


E
-
rater®


C
-
rater®


Intelligent Essay Assessor®


Criterion®


These programs illustrate large ongoing data flows and
generally reflect the paradigmatic approach.

Digitizing Data

Task

Traditional Research

Semantic Web

Paradigmatic
Approach

Digitizing

Data from Internet digitized
by web page developers.

Other data must be
digitized by researcher
or analyzed manually.
This can be a huge
hurdle.

Data digitized by
web page
developers

Data digitized
by web page
developers

The first step in any computer analysis must be converting relevant data to
digital form

where it is expressed as a stream of digits that can be
transmitted and manipulated by computers

These two approaches both rely on web page developers to digitize
information. This gives them a distinct advantage over traditional research
where digitizing data can be a major hurdle.

Essay Grading: Digitizing
Data


Digitizing


Papers replaced with digital submissions


SAGrader, for example, has students submit
their papers over the Internet using standard
web browsers.


Digitizing often still a major hurdle limiting
use


Access issues


Security concerns

Data Conversions

Task

Traditional Research

Semantic Web

Paradigmatic Approach

Converted

Data

Digitized data
suitable for web
delivery for
human
interpretation

Digitized data suitable for
web delivery for
human
interpretation

Digitized data suitable for
web delivery and
machine
interpretation

Converting

No further data
conversions required
once digitized by
web page author

No further data conversions
required once digitized by
web page author

Further conversion
sometimes required
by
researcher

(e.g.,
OCR
,
speech recognition
,
handwriting recognition
)

Essay Grading: Converting
Data


Data conversion


Where essays are submitted on paper,
optical character recognition (OCR) or
handwriting recognition programs must
be used to convert to digitized text.


Standardized testing programs often face this
issue

Encoding Data

Task

Traditional
Research

Semantic Web

Paradigmatic Approach

Encoding
Data

Encoding done
by researcher
(often with use
of qualitative
or quantitative
programs)

Each
web page
developer

must
encode small or
moderate amount of
data

Researchers

must encode massive
amounts of data

Encoding

automated

using

NLP

strategies (including statistical,
linguistic, rule
-
based expert systems,
and combined strategies)

machine learning

(unsupervised
learning, supervised learning, neural
networks, genetic algorithms, data
mining)

Coded
Data

Coded data
based on
coding rubric

XML markup based
on

standard
ontology

An XML schema
indicates the basic
structure expected
for a web page

XML markup based on

ontology for
that paradigm

An XML schema indicates the basic
structure expected for a web page

Essay Grading: Coding


Essay grading programs employ a wide array of strategies for
recognizing important features in essays.


Intelligent Essay Assessor (IEA) employs a purely statistical
approach, latent semantic analysis (LSA).


This approach treats essays like a “bag of words” using a matrix of
word frequencies by essays and factor analysis to find an underlying
semantic space. It then locates each essay in that space and
assesses how closely it matches essays with known scores.


E
-
rater uses a combination of statistical and linguistic approaches.


It uses syntactic, discourse structure, and content features to predict
scores for essays after the program has been trained to match human
coders.


SAGrader uses a strategy that blends linguistic, statistical, and AI
approaches.


It uses
fuzzy logic

to detect key concepts in student papers and a
semantic network

to represent the semantic information that should
be present in good essays.


All of these programs
require learning

before they can be used to
grade essays in a specific domain.



Knowledge

Task

Traditional Research

Semantic Web

Paradigmatic Approach

Knowledge

Theory

A
single shared world
-
view or objective reality

Multiple paradigms

Coding scheme
implemented with
a Codebook (often
imperfect)

Ontology

(knowledgebase
developed by web page
developers

and shared as
standard) (implemented
with RDF and ontological
languages)

Multiple ontologies
, one
for each paradigm
(
developed by
researchers

and shared
within paradigm)
(implemented with RDF
and ontological
languages)

Essay Grading: Knowledge



Most essay grading programs have very little in the way of a
representation of theory or knowledge.


This is probably because they are often designed specifically
for grading essays and are not meant to be used for other
purposes requiring theory, such as social science research.


For example, C
-
rater, a program that emphasizes semantic
content in essays, yet has no representation of semantic content
other than as desirable features for the essay.



The exception is SAGrader.


SAGrader employs technologies developed in a qualitative
analysis program, Qualrus. Hence, SAGrader uses a semantic
network to explicitly represent and reason about the knowledge
or theory.

Analysis

Task

Traditional Research

Semantic Web

Paradigmatic Approach

Analysis

Analysis (by hand, perhaps
with help of
qualitative or
quantitative programs
)

Intelligent Agents

Intelligent agents

The semantic web and paradigmatic approaches can take similar approaches
to analysis.

Essay Grading: Analysis


All programs produce scores, though the precision
and complexity of the scores varies.


Some produce explanations


Most of these essay grading programs simply
perform a
one
-
time analysis

(grading) of papers.
However some of them, such as SAGrader, provide
for
ongoing monitoring

of student performance as
students revise and resubmit their papers.


Since essays presented to the programs are
already converted into standard formats and are
submitted to a central site for processing, there is
no need for the search and retrieval capabilities of
intelligent agents




Advantages of Paradigmatic
Approach


Suitable for multiple
-
paradigm fields


Suitable for contested issues


Does not require as much
infrastructure development on the web


Can be used for new views requiring
different codes with little lag time

Disadvantages of
Paradigmatic Approach


Relies heavily on NLP technologies that are still evolving


May not be feasible in some or all circumstances


Requires extensive machine learning


Often requires additional data conversion for automated
analysis


Requires individual web pages to be coded once for each
paradigm rather than a single time, hence increasing costs.
(However, by automating this, costs are made manageable)


Current NLP capabilities are limited to problems of restricted
scope. Instead of general
-
purpose NLP programs, they are
better characterized as special
-
purpose NLP programs.

Discussion and Conclusions


Both semantic web and paradigmatic approaches
have advantages and disadvantages


Codes on semantic web could facilitate coding by
paradigmatic
-
approach programs


Where there is much consensus the single coding
for the semantic web could be sufficient


While the infrastructure for the semantic web is still
in development the paradigmatic approach could
facilitate analysis of legacy data


The paradigmatic approach could be used to build
out the infrastructure for the semantic web