Text mining and the Semantic Web

sounderslipInternet και Εφαρμογές Web

22 Οκτ 2013 (πριν από 4 χρόνια και 21 μέρες)

80 εμφανίσεις

Text mining and the Semantic Web


Dr Diana Maynard

NLP Group

Department of Computer Science

University of Sheffield


http://nlp.shef.ac.uk

University of Manchester


15 March 2005


2

Structure of this lecture


Text Mining and the Semantic Web


Text Mining Components / Methods


Information Extraction


Evaluation


Visualisation


Summary

Introduction to Text Mining and
the Semantic Web

http://nlp.shef.ac.uk

University of Manchester


15 March 2005


4

What is Text Mining?


Text mining is about knowledge discovery from
large collections of unstructured text.


It’s not the same as data mining, which is more
about discovering patterns in structured data
stored in databases.


Similar techniques are sometimes used,
however text mining has many additional
constraints caused by the unstructured nature of
the text and the use of natural language.


Information extraction (IE) is a major component
of text mining.


IE is about extracting
facts

and
structured
information

from unstructured text.

http://nlp.shef.ac.uk

University of Manchester


15 March 2005


5

Challenge of the Semantic Web


The Semantic Web requires machine
processable, repurposable data to complement
hypertext


Such metadata can be divided into two types of
information: explicit and implicit. IE is mainly
concerned with implicit (semantic) metadata.


More on this later…





Text mining components and
methods

http://nlp.shef.ac.uk

University of Manchester


15 March 2005


7

Text mining stages



Document selection and filtering (IR
techniques)


Document pre
-
processing (NLP
techniques)


Document processing (NLP / ML /
statistical techniques)


http://nlp.shef.ac.uk

University of Manchester


15 March 2005


8

Stages of document processing


Document selection involves identification and
retrieval of potentially relevant documents from a
large set (e.g. the web) in order to reduce the
search space. Standard or semantically
-
enhanced IR techniques can be used for this.


Document pre
-
processing involves cleaning and
preparing the documents, e.g. removal of
extraneous information, error correction, spelling
normalisation, tokenisation, POS tagging, etc.


Document processing consists mainly of
information extraction


For the Semantic Web, this is realised in terms
of metadata extraction

http://nlp.shef.ac.uk

University of Manchester


15 March 2005


9

Metadata extraction


Metadata extraction consists of two types:


Explicit

metadata extraction involves
information describing the document, such as
that contained in the header information of
HTML documents (titles, abstracts, authors,
creation date, etc.)


Implicit
metadata extraction involves semantic
information deduced from the material itself, i.e.
endogenous information such as names of
entities and relations contained in the text. This
essentially involves Information Extraction
techniques, often with the help of an ontology.

Information Extraction (IE)


http://nlp.shef.ac.uk

University of Manchester


15 March 2005


11

IE is not IR

IE pulls
facts

and
structured information

from the content of large
text collections. You
analyse the
facts
.


IR pulls
documents

from large text
collections (usually the
Web) in response to
specific keywords or
queries. You analyse
the
documents
.



http://nlp.shef.ac.uk

University of Manchester


15 March 2005


12

IE for Document Access


With traditional query engines, getting the
facts can be hard and slow


Where has the Queen visited in the last
year?


Which places on the East Coast of the
US have had cases of West Nile Virus?


Which search terms would you use to get this
kind of information?


How can you specify you want someone’s
home page?


IE returns information in a structured way


IR returns documents containing the relevant
information somewhere (if you’re lucky)



http://nlp.shef.ac.uk

University of Manchester


15 March 2005


13

IE as an alternative to IR


IE returns knowledge at a much deeper
level than traditional IR


Constructing a database through IE and
linking it back to the documents can
provide a valuable alternative search tool.


Even if results are not always accurate,
they can be valuable if linked back to the
original text

http://nlp.shef.ac.uk

University of Manchester


15 March 2005


14

Some example applications


HaSIE


KIM


Threat Trackers

http://nlp.shef.ac.uk

University of Manchester


15 March 2005


15

HaSIE


Application developed by University of
Sheffield, which aims to find out how
companies report about health and safety
information


Answers questions such as:

“How many members of staff died or had accidents
in the last year?”

“Is there anyone responsible for health and
safety?”

“What measures have been put in place to
improve health and safety in the workplace?”

http://nlp.shef.ac.uk

University of Manchester


15 March 2005


16

HASIE


Identification of such information is too
time
-
consuming and arduous to be done
manually


IR systems can’t cope with this because
they return whole documents, which could
be hundreds of pages


System identifies relevant sections of each
document, pulls out sentences about
health and safety issues, and populates a
database with relevant information

http://nlp.shef.ac.uk

University of Manchester


15 March 2005


17

HASIE

http://nlp.shef.ac.uk

University of Manchester


15 March 2005


18

KIM


KIM is a software platform developed by
Ontotext for semantic annotation of text.


KIM performs automatic ontology
population and semantic annotation for
Semantic Web and KM applications


Indexing and retrieval (an IE
-
enhanced
search technology)


Query and exploration of formal
knowledge


http://nlp.shef.ac.uk

University of Manchester


15 March 2005


19

KIM

Ontotext’s KIM query and results


http://nlp.shef.ac.uk

University of Manchester


15 March 2005


20

Threat tracker


Application developed by Alias
-
I which finds and
relates information in documents


Intended for use by Information Analysts who
use unstructured news feeds and standing
collections as sources


Used by DARPA for tracking possible
information about terrorists etc.


Identification of entities, aliases, relations etc.
enables you to build up chains of related people
and things


http://nlp.shef.ac.uk

University of Manchester


15 March 2005


21

Threat tracker

http://nlp.shef.ac.uk

University of Manchester


15 March 2005


22

What is Named Entity Recognition?


Identification of proper names in texts, and
their classification into a set of predefined
categories of interest


Persons


Organisations (companies, government
organisations, committees, etc)


Locations (cities, countries, rivers, etc)


Date and time expressions


Various other types as appropriate

http://nlp.shef.ac.uk

University of Manchester


15 March 2005


23

Why is NE important?


NE provides a foundation from which to
build more complex IE systems


Relations between NEs can provide
tracking, ontological information and
scenario building


Tracking (co
-
reference) “Dr Head, John,
he”


Ontologies “Manchester, CT”


Scenario “Dr Head became the new
director of Shiny Rockets Corp”


http://nlp.shef.ac.uk

University of Manchester


15 March 2005


24

Two kinds of approaches

Knowledge Engineering



rule based


developed by experienced
language engineers


make use of human
intuition


require only small amount
of training data


development can be very
time consuming


some changes may be
hard to accommodate

Learning Systems



use statistics or other
machine learning


developers do not need
LE expertise


require large amounts of
annotated training data


some changes may
require re
-
annotation of
the entire training corpus

http://nlp.shef.ac.uk

University of Manchester


15 March 2005


25

Typical NE pipeline


Pre
-
processing (tokenisation, sentence
splitting, morphological analysis, POS
tagging)


Entity finding (gazeteer lookup, NE
grammars)


Coreference (alias finding, orthographic
coreference etc.)


Export to database / XML


http://nlp.shef.ac.uk

University of Manchester


15 March 2005


26

GATE and ANNIE


GATE (Generalised Architecture for Text
Engineering) is a framework for language
processing


ANNIE (A Nearly New Information Extraction
system) is a suite of language processing tools,
which provides NE recognition

GATE also includes:


plugins for language processing, e.g. parsers,
machine learning tools, stemmers, IR tools, IE
components for various languages etc.


tools for visualising and manipulating ontologies


ontology
-
based information extraction tools


evaluation and benchmarking tools

http://nlp.shef.ac.uk

University of Manchester


15 March 2005


27

GATE

http://nlp.shef.ac.uk

University of Manchester


15 March 2005


28

Information Extraction for the Semantic Web


Traditional IE is based on a flat structure, e.g.
recognising Person, Location, Organisation,
Date, Time etc.


For the Semantic Web, we need information in a
hierarchical structure


Idea is that we attach semantic metadata to the
documents, pointing to concepts in an ontology


Information can be exported as an ontology
annotated with instances, or as text annotated
with links to the ontology

http://nlp.shef.ac.uk

University of Manchester


15 March 2005


29

Richer NE Tagging


Attachment of
instances in the text to
concepts in the
domain ontology


Disambiguation of
instances, e.g.
Cambridge, MA vs
Cambridge, UK

http://nlp.shef.ac.uk

University of Manchester


15 March 2005


30

Magpie


Developed by the Open University


Plugin for standard web browser


Automatically associates an ontology
-
based
semantic layer to web resources, allowing
relevant services to be linked


Provides means for a structured and informed
exploration of the web resources


e.g. looking at a list of publications, we can find
information about an author such as projects
they work on, other people they work with, etc.

http://nlp.shef.ac.uk

University of Manchester


15 March 2005


31

MAGPIE in action

http://nlp.shef.ac.uk

University of Manchester


15 March 2005


32

MAGPIE in action

Evaluation

http://nlp.shef.ac.uk

University of Manchester


15 March 2005


34

Evaluation metrics and tools


Evaluation metrics mathematically define how to
measure the system’s performance against
human
-
annotated gold standard


Scoring program implements the metric and
provides performance measures


for each document and over the entire corpus


for each type of NE


may also evaluate changes over time


A gold standard reference set also needs to be
provided


this may be time
-
consuming to
produce


Visualisation tools show the results graphically
and enable easy comparison



http://nlp.shef.ac.uk

University of Manchester


15 March 2005


35

Methods of evaluation


Traditional IE is evaluated in terms of Precision
and Recall


Precision

-

how accurate were the answers the
system produced?


correct answers/answers produced


Recall

-

how good was the system at finding
everything it should have found?


correct answers/total possible correct answers



There is usually a tradeoff between precision
and recall, so a weighted average of the two (
F
-
measure
) is generally also used.


http://nlp.shef.ac.uk

University of Manchester


15 March 2005


36

GATE AnnotationDiff Tool

http://nlp.shef.ac.uk

University of Manchester


15 March 2005


37

Metrics for Richer IE


Precision and Recall are not sufficient for
ontology
-
based IE, because the distinction
between right and wrong is less obvious


Recognising a Person as a Location is clearly
wrong, but recognising a Research Assistant as
a Lecturer is not so wrong


Similarity metrics need to be integrated
additionally, such that items closer together in
the hierarchy are given a higher score, if wrong


Also possible is a cost
-
based approach, where
different weights can be given to each concept in
the hierarchy, and to different types of error, and
combined to form a single score


Visualisation of Results


http://nlp.shef.ac.uk

University of Manchester


15 March 2005


39

Visualisation of Results


Cluster Map example


Traditionally used to show documents classified
according to topic


Here shows instances classified according to
concept


Enables analysis, comparison and querying of
results


Examples here created by Marta Sabou (Free
University of Amsterdam) using Aduna software


http://nlp.shef.ac.uk

University of Manchester


15 March 2005


40

The principle


Venn Diagrams

Documents
classified
according to topic

http://nlp.shef.ac.uk

University of Manchester


15 March 2005


41

Jobs by region

Instances
classified by
concept

http://nlp.shef.ac.uk

University of Manchester


15 March 2005


42

Concept distribution

Shows the
relative
importance of
different concepts

http://nlp.shef.ac.uk

University of Manchester


15 March 2005


43

Correct and
incorrect
instances
attached to
concepts

http://nlp.shef.ac.uk

University of Manchester


15 March 2005


44

Summary


Introduction to text mining and the
semantic web


How traditional information extraction
techniques, including visualisation and
evaluation, can be extended to deal with
complexity of the Semantic Web


How text mining can help the progression
of the Semantic Web


http://nlp.shef.ac.uk

University of Manchester


15 March 2005


45

Research questions


Automatic annotation tools are currently
mainly domain and ontology
-
dependent,
and work best on a small scale


Tools designed for large scale applications
lose out on accuracy


Ontology population works best when the
ontology already exists, but how do we
ensure accurate ontology generation?


Need large scale evaluation programs

http://nlp.shef.ac.uk

University of Manchester


15 March 2005


46

Some useful links


NaCTem (National centre for text mining)


http://www.nactem.ac.uk


GATE


http://gate.ac.uk


KIM

http://www.ontotext.com/kim/


h
-
TechSight

http://www.h
-
techsight.org


Magpie

http://www.kmi.open.ac.uk/projects/magpie