Domain-specific Web Corpora and their Applications

dealerdeputyΤεχνίτη Νοημοσύνη και Ρομποτική

25 Νοε 2013 (πριν από 3 χρόνια και 11 μήνες)

70 εμφανίσεις

12 July 2002

Colloquim on Applications of Natural
Langauge Corpora, Saarland University

Domain
-
specific Web Corpora

and their Applications

Gregor Erbach

Saarland University

Project COLLATE

(funding: BMBF 01 IN A01 B)


12 July 2002

Colloquim on Applications of Natural
Langauge Corpora, Saarland University

Outline

Part I: Web Corpora

Part II: Applications of Web Corpora

Part III: LT
-
World Web Corpus

Part IV: Research in COLLATE

12 July 2002

Colloquim on Applications of Natural
Langauge Corpora, Saarland University

Part I: Web Corpora

1.
Formal Properties of the Web

2.
Web Corpus

3.
Document and Hyperlink Database

4.
TREC web track

12 July 2002

Colloquim on Applications of Natural
Langauge Corpora, Saarland University

Formal Properties of the Web


Hypertext/Hypermedia


Directed graph with cycles


Edges = hyperlinks


Nodes = documents ???


Nodes often have internal tree structure (HTML,
XML)




12 July 2002

Colloquim on Applications of Natural
Langauge Corpora, Saarland University

Web Corpus

A web corpus consists of


a database of documents


a database of hyperlinks

12 July 2002

Colloquim on Applications of Natural
Langauge Corpora, Saarland University

Document Database

Information for each document:


URL/URN


Full Text (possibly with linguistic annotation such as
POS, named entities, phrases)


Full Text Index


Metadata


Author, Language, Date, MIME type … (Dublin Core)


Category, Abstract, Keywords, Type of Page …

12 July 2002

Colloquim on Applications of Natural
Langauge Corpora, Saarland University

Fields of Hyperlink Database


source anchor URL


source anchor position on web page (percentage)


source anchor position in document structure (HTML
element path)


source anchor type (text or image)


source anchor text and context


target anchor URL


target anchor position on web page


target anchor MIME type

12 July 2002

Colloquim on Applications of Natural
Langauge Corpora, Saarland University

Derived Properties of Hyperlinks


Same document?


Same server?


Same 2
nd
/3
rd

level domain?


Ascending of descending in directory structure


Source is within a list of links


Navigation link (up, previous, next …)



12 July 2002

Colloquim on Applications of Natural
Langauge Corpora, Saarland University

TREC web track


Construction of a web corpus (WT10g) according to the
following criteria:



Broadly representative of web data in general


Many inter
-
server links


Contains all available pages from a set of servers


Contains an interesting set of meta
-
data


Contains few binary, non
-
English or duplicate documents


Size: 10 GB




P. Bailey, N. Craswell and D. Hawking. Engineering a multi
-
purpose test collection for Web retrieval experiments. IP&M, to appea
r.


12 July 2002

Colloquim on Applications of Natural
Langauge Corpora, Saarland University

Part II: Applications of Web Corpora

1.
Web Mining

2.
Information Retrieval

3.
Clustering and Categorisation

4.
Summarisation

5.
Discovery of Relations

6.
Terminology Extraction

7.
Information Extraction

8.
Ontology Learning


12 July 2002

Colloquim on Applications of Natural
Langauge Corpora, Saarland University

Useful Methods


Machine Learning and Data Mining


Natural Language Processing


Information Retrieval


Ontologies and Semantic Web


Bibliometrics (citation analysis ~ link analysis)


12 July 2002

Colloquim on Applications of Natural
Langauge Corpora, Saarland University

Web Mining


Web Content Mining


Discovery of terminology, acronyms, concepts


Web Structure Mining


Discovery of relations, communities …


Web Usage Mining


Discovery of navigation patterns

12 July 2002

Colloquim on Applications of Natural
Langauge Corpora, Saarland University

Information Retrieval


Usage of hyperlinks for determining popularity of
web pages


Hub and authority pages


Widely used: Google PageRank


Mixed results in TREC web track



Jon M. Kleinberg (1997) Authoritative Sources in a Hyperlinked Environment. Journal of the ACM


Sergey Brin, Lawrence Page (1998) The Anatomy of a Large
-
Scale Hypertextual Web Search Engine.

Computer Networks
and ISDN Systems



12 July 2002

Colloquim on Applications of Natural
Langauge Corpora, Saarland University

Clustering


Standard clustering algorithms form clusters by
iteratively grouping documents/clusters, according
to a distance measure


Content
-
based methods measure distance by
counting terms/concepts (often TF/IDF)


Connectivity
-
based distance measures make use of
hyperlinks



12 July 2002

Colloquim on Applications of Natural
Langauge Corpora, Saarland University

Categorisation


Categorisation algorithms determine the
membership of a document in a pre
-
defined
thematic category


Content
-
based categorisation methods measure
distance from a representative of the category


Connectivity
-
based distance measures are based
on the assumption that certain types of hyperlinks
lead to documents of the same category

12 July 2002

Colloquim on Applications of Natural
Langauge Corpora, Saarland University

Summarisation / Keyword Extraction


Source anchor text has been used to generate short
summaries of target web pages.


12 July 2002

Colloquim on Applications of Natural
Langauge Corpora, Saarland University

Discovery of Relations


Hyperlink structure reflects relations between web
resources (e.g. between personal homepage,
project page, organisation page)


Relations can be discovered by content
-
based
methods and by connectivity
-
based methods

12 July 2002

Colloquim on Applications of Natural
Langauge Corpora, Saarland University

Terminology Extraction


Content
-
based: extraction of domain terminology
by statistical analysis (TF/IDF …) and/or phrasal
chunking


Applicability of connectivity
-
based methods?

12 July 2002

Colloquim on Applications of Natural
Langauge Corpora, Saarland University

Information Extraction



Automatic extraction of meta
-
data


Extraction of named entities for concept
-
based
indexing


Extraction of templates/relations for relation
-
based
indexing, and question answering


12 July 2002

Colloquim on Applications of Natural
Langauge Corpora, Saarland University

Ontology Learning


Extraction of candidates by frequency of
occurrence in similar contexts


Usage of textual clues (“such as”, “sogar” …)


Applicability of connectivity
-
based methods?



Definition and acronym mining

12 July 2002

Colloquim on Applications of Natural
Langauge Corpora, Saarland University

Part III: LT
-
World Web Corpus

1.
Content of LT World

2.
Ontology

3.
Hyperlinking within LT World

4.
Construction of the corpus


12 July 2002

Colloquim on Applications of Natural
Langauge Corpora, Saarland University

LT World: Idea and Context


The virtual information center is a comprehensive WWW
-
based information and knowledge service for the entire
area of language technology.


LT World is a “virtual” center in the sense that most
information will physically remain with their creators or
with other service providers.


The virtual information center has been online since
October 2001 under the name „LT World“ for „Language
Technology World“ (www.lt
-
world.org)

12 July 2002

Colloquim on Applications of Natural
Langauge Corpora, Saarland University

12 July 2002

Colloquim on Applications of Natural
Langauge Corpora, Saarland University

Virtual Information Center
-

LT World


Information and Knowledge


Technical and Scientific Information


Players and Teams


Persons, Projects, Organisations


Resources and Results


Research Systems, Commercial Products


Communication and Events


News, Conferences



12 July 2002

Colloquim on Applications of Natural
Langauge Corpora, Saarland University

LT World Ontology

Publi

cations

Products

Projects

People

Layer 2: Specific Ontologies

Corpora

etc.

Layer 1: Dublin Core

Layer 3: Ontology for CL & LT

12 July 2002

Colloquim on Applications of Natural
Langauge Corpora, Saarland University

LT World Ontology


Dimensions


Linguality (monolingual, multilingual, cross
-
language)


Application


Computational/mathematical methods


Linguistic Models / Theories


Level of linguistic description/processing


Technologies


Language(s)


Ontology is modelled in RDF with Protégé 2000

12 July 2002

Colloquim on Applications of Natural
Langauge Corpora, Saarland University

LT World: Coverage


99 topic nodes


300 NLP tools and products


1800 people


850 organisations


500 projects



12 July 2002

Colloquim on Applications of Natural
Langauge Corpora, Saarland University

Data Acquisition Process


Manual collection, categorization and annotation
of URLs by students and staff


Sources: conference proceedings and journals,
lists of links on the web,


Self
-
registration and correction of data by users of
the service


Technical/scientific information in topic nodes has
been provided by domain experts

12 July 2002

Colloquim on Applications of Natural
Langauge Corpora, Saarland University

LT World: Topic Nodes


Topic nodes are the main information unit of the Area
“Knowledge and Information”. They are organized in a
shallow slightly multidimensional hierarchy following the
chapter plan of the second edition of the Language
Technology Survey.


Example of the shallow hierarchy:

Information Extraction


Named Entity Recognition


Terminology Extraction


Relation Extraction


Answer Extraction




12 July 2002

Colloquim on Applications of Natural
Langauge Corpora, Saarland University

Information for each Topic


Name


Acronyms


aka‘s, Term Translations


Short Definition


Overview Article (from HLT Survey)


Topic Websites


R&D Prototypes/Products


Projects


People


Literature


12 July 2002

Colloquim on Applications of Natural
Langauge Corpora, Saarland University

Hyperlinking between Sections

12 July 2002

Colloquim on Applications of Natural
Langauge Corpora, Saarland University

Corpus Construction


Start from URLs in LT
-
World collection


Expand document set by recursively following outgoing
hyperlinks using a webspider (e.g., GNU wget)


Expand document set by following incoming hyperlinks
(“link” query to search engine)


Expand document set by search engine queries with
domain terminology


Construct document database and link database


(Filter out irrelevant documents)


Publish Corpus



12 July 2002

Colloquim on Applications of Natural
Langauge Corpora, Saarland University

Part IV: Research Directions

Categorisation / Information Extraction

Discovery of Relations for Hyperlinking

Other

12 July 2002

Colloquim on Applications of Natural
Langauge Corpora, Saarland University

Categorisation and Information Extraction


Research objectives


find method for categorising documents according to
LT
-
World ontology


find method for extraction of meta
-
information


Compare and combine content
-
based and
connectivity
-
based methods


If successful, it will contribute to semi
-
automatic
extension of the coverage of LT
-
World

12 July 2002

Colloquim on Applications of Natural
Langauge Corpora, Saarland University

Discovery of Relations


Objective: develop method for finding pairs of
related documents, e.g. personal page


organisation page.


Content
-
based and connectivity
-
based methods are
applicable


If successful, it will enable a significant
improvement of LT
-
World (resource discovery,
resource annotation)

12 July 2002

Colloquim on Applications of Natural
Langauge Corpora, Saarland University

Other



Objective: compare and combine content
-
based
and connectivity
-
based clustering methods


Applications:

1.
Information Retrieval

2.
Clustering

3.
Summarisation

4.
Terminology Extraction

5.
Ontology Learning


12 July 2002

Colloquim on Applications of Natural
Langauge Corpora, Saarland University

Conclusion


Main research interest: comparison and
combination of content
-
based and connectivity
-
based methods



Main application impact: going from a set of
“seed” web pages to a domain
-
specific
information system