DGD 2.0: A Web-based Navigation Platform for the Visualization, Presentation and

erminerebelΤεχνίτη Νοημοσύνη και Ρομποτική

15 Νοε 2013 (πριν από 3 χρόνια και 8 μήνες)

88 εμφανίσεις

Mitglied der

1

DGD 2.0: A Web
-
based Navigation Platform
for the Visualization, Presentation and
Retrieval of German Speech Corpora

Joachim Gasch

E
-
mail: gasch@ids
-
mannheim.de

Mitglied der

2


1.
Introduction


1.1 The Collection of German Speech Corpora at the IDS


1.2 The Standardization Approach for cross
-
Corpus Information Management


2. The Online Navigation Platform


2.1 The Navigation Interface


Design Principals


2.2 The Visualization and Presentation of Speech Corpus Content



2.2.1 Generic Visualization of the XML Meta
-
Information of Speech

Corpora



2.2.2 Transcript Visualization and Presentation



2.2.3 Media Presentation


3. Retrieval Strategies for unstructured and structured Speech Corpus Data
Components


3.1 The Full
-
Text Search Module


3.2 XQuery Information Retrieval in structured XML Documents


4. Summary and Outlook

Mitglied der

3

1. Introduction

1.1 The Collection of German Speech Corpora at the IDS



The IDS is hosting a wide range of historical and contemporary
German speech corpora


Many historical corpora can be (partially) accessed online via the
Database for Spoken German (DGD)


=> Main objectives of the current DGD 2.0 project:



Generic, cross
-
corpus approach to speech corpus management


Normalized integration of historical and recent speech corpora


Sustainability of speech corpus data components


Object
-
oriented user interface (based on document structures) for
corpus exploration and querying

Mitglied der

4

1.2 The Standardization Approach for cross
-
Corpus Information
Management



The speech corpus system manages meta
-
information of media source
signals


Different corpora: the information structures of data components may
vary considerably due to different linguistic research questions, i.e.
represented genres, degree of content restriction, physical data
structure, research field (natural vs. elicited speech)


=> Web
-
based speech corpus navigation platform:


Standardization concept: cross
-
corpus solution for large speech
corpus collections rather than for particular speech corpus projects


Definition of a generic, system
-
wide data model containing the
following components (systematically interlinked):


+ structured XML documentation instances on corpus
-
, event
-

and
speaker level

+ unstructured, semi
-
structured or structured transcripts (time aligned,
multi
-
dimensional)

+ media source files

+ optional: unstructured secondary documents

Mitglied der

5




















Interlinked components of the normalized speech corpus data model

Mitglied der

6

2. The Online Navigation Platform

2.1 The Navigation Interface
-

Design Principals



Object
-
oriented, document
-
centric interaction paradigm: based on
document structures to be managed by the system


Provision of adaptive views of speech corpus data components


=> The application menu:


Flat structure of the navigation menu


Fixed position at the top of the screen


Permanent, homogeneous acces to application components


Indication of flat / hierarchically subdivided menu entry points by the
symbols


and


Mitglied der

7

=> Classifying icons


Intuitive user orientation by marking specific types of corpus data
components with their correspondent icons:




=> „bread crumb“ navigation:


Help the user to identify his current position in the navigation tree





Mitglied der

8

2.2 The Visualization and Presentation of Speech Corpus Content

2.2.1 Generic Visualization of the XML Meta
-
information



Native XML database storage of documentation instances


Use of generic XML rendering module to avoid corpus specific instance
visualizations, providing:


+ expandable / collapsible document nodes

+ node level selection functionality

+ direct access to hyperlinks


=> The cross
-
corpus (single coprus independent) display method of corpus
-
,
event and speaker documentation offers an ergonomic navigation
experience (especially for large data
-
centric XML instances)

Mitglied der

9






















Generic XML document rendering

Mitglied der

10

=> Documentation of geocodes:



The geographic coordinates of event locations may be documented in
specific speech corpus projects



A geographic map can be displayed on demand: the example
shows
the geographic map for the event DH
--
_E_00167 (with geographic
latitude
47.423336
and longitude 9.377225 ) which took place in St.
Gallen (Switzerland)

Mitglied der

11

Geographic map (based on documented geocodes showing the event location)

Mitglied der

12

2.2.2 Transcript Visualization and Presentation



For larger speech corpus collections, a common concept of
„transcript“ becomes fuzzy:

+ Annotation of distinct phenomena

+ Use of heterogeneous (transcript editor specifc) data formats



Historical speech copora:

+ Unstructured transcript data formats (only layout oriented)



Contemporary speech corpora:

+ Use of annotation tools available nowadays: structured data formats
but no cross
-
corpus structure homogeneity



Cross
-
corpus visualization is possible for the transcript
-
related part of
the event documentations via menu point „Transkripte“ (corpus
specific transcript access lists)

Mitglied der

13

Corpus
-
specific transcript list for the speech corpus DS

Mitglied der

14

2.2.3 Media Presentation



Speech corpora may include different types of interdependent
media files:


+ One event is related to one or more source files:

the raw material recorded for an event (originating directly from an
audio device)

+ An event can be composed of several speech events:

further segmentation of the source files into speech event specific
recordings



All relevant information regarding different media file types is
maintained in the meta
-
documentation of the corresponding event
and can be accessed via the list of the menu point “Aufnahmen”

Mitglied der

15

Corpus
-
specific list of source recordings for the speech corpus DH

Mitglied der

16

3. Retrieval Strategies for unstructured and structured Speech
Corpus Data Components



Media file content can only be located via descriptive meta
-
information:

+ meta data (schema valid XML instances)

+ transcript data (unstructured, semi
-
structured, structured)


Transcript data of speech corpus collections is spreading regarding
the structuring degree


Retrieval strategies depend on this degree: from simple full
-
text
search to complex layer
-
aware query processing


Single corpus transcript incompatibilities (worst case scenario):


+
Signal segmentation without precise segmentation guidelines (i.e. phones,
words, phrases or turns)

+ No or not sufficient naming conventions applied for the different transcript
layer descriptors (i.e. no unique descriptor used for orthographic transcription
layer)

+ No exact semantic layer definition available or semantic mix
-
up of layer
content (i.e. mix
-
up of orthographic and phonetic markup in one single layer)

+ No exact syntactic definition of layer content available or syntactic mix
-
up of
layer content (i.e. mix
-
up of punctuation
-

or capitalization conventions in the
orthographic layer)

+ Violation of cross
-
layer time relations (i.e. caused by interval changes that
were made with multi
-
layer transcript editors without layer inheritance control)


Mitglied der

17

3.1 The Full
-
Text Search Module



No structured data is required (but can be optionally included)


Advantages: short query response times, easy user interface handling


The full
-
text search functionality is implemented using Oracle Text


Examples of the provided full
-
text query features:


+
The simple and multiple wildcard characters "
_
" and "
%
":


_ind

matches i.e. "Kind" and "Wind“


%wind

matches i.e. "Nordwind" or Südwind“

+ The operators
AND

and
OR

build logical relations between search terms:


Nordwind AND Südwind

matches only documents with occurrences of

both terms

+ Tthe
NOT

operator excludes a specific search term:


Nordwind NOT Südwind

matches only documents containing

"Nordwind" but not containing "Südwind“

+ The
NEAR

operator finds documents depending on the word distance of search
terms:


NEAR((Schule, Kirche, 4, true)

matches documents where both search

terms occur with a (maximum) word distance of 4 words.

Mitglied der

18

Full
-
text search in semi
-
structured transcript data with search results (KWIC
-
list)

Mitglied der

19

3.2 XQuery Information Retrieval in structured XML Documents


The full
-
text search option is not sufficient for the retrieval in fine
-
grained XML instances (like meta data or time aligned multi
-
dimensional
transcripts)


XQuery allows the implementation of context
-
sensitive queries for the
hierarchical interdependent informational units of XML structured data:

+ criteria
-
specific information selection and filtering

+ joining of data from document selections

+ sorting, grouping, aggregating, transforming and restructuring of data

+ arithmetic calculations on numbers and dates


Powerful queries can be defined but a detailed knowledge about the
underlying information structures is necessary


=> Two different approaches for the implementation of Web
-
based XQuery
retrieval interfaces:

+ HTML form with a graphical representation of the XML tree (easy to
use but limited flexibility for query definition)

+ HTML form providing a text area field to enter the XQuery as plain text
(intended for system experts only, also complex queries on data centric
instances or cross
-
structural joins are possible)


Mitglied der

20

HTML form providing a graphical XQuery composition interface

Mitglied der

21

HTML form for XQuery plain text submission

Mitglied der

22

4. Summary and Outlook



Media source files become analyzable via their appropriate meta
-
information


Contemporary speech corpus systems have to close the gap between the
processing of binary media data and related meta
-
information


The need for standardization of speech corpus components is commonly
accepted


But: the identification of all necessary parameters for a cross
-
corpus
standardization still remains an outstanding goal



Future evolving
technologies like the MPEG
-
7 standard might provide
appropriate logic to achieve the standardized integration of the different
audiovisual information types (potentially involved in media corpora):


+ Audio


+ Voice


+ Video


+ Images


+ Graphs


+ 3D models









=> Questions? Suggestions?