PANACEA-WP3-Y3-review_v03x

fortnecessityusefulΛογισμικό & κατασκευή λογ/κού

14 Δεκ 2013 (πριν από 3 χρόνια και 10 μήνες)

210 εμφανίσεις

PANACEA WP3

The Platform

WP participants:

UPF, ILC, ILSP, LG, DCU, ELDA


Final Annual Review

19
th

February 2013

Marc Poch, UPF (marc.pochriera@upf.edu)

1

Summary


Objectives


Platform components / Demo


Achievements


Functional platform


Interoperability:
Travelling Object, Common Interfaces,
format converters, etc.


Scalability


WP7 Evaluation


Conclusions and future work


2

Objectives


Development

of

a

platform

(a

space

of

interoperability

defined

by

standardized

protocols

and

common

interfaces
)

for

the

easy

integration

of

a

variety

of

software

components,

tools

and

methodologies

deployed

as

web

services

to

configure

a

factory

for

the

automation

of

acquisition,

processing

and

annotation

of

language

resources
.

3

WP
3
.
1
.

(T
1
-
T
6
)

Architecture

and

design

of

the

platform



WP
3
.
2

(T
15
-
T
30
)

Work

Flow

editor

and

engine



WP
3
.
3
.

(T
7
-
T
30
)

Common

interfaces
,

middleware

and

temporal

files,

journaling,

etc
.



WP
3
.
4

(T
15
-
T
30
)

The

Registry



WP
3
.
5

(T
7
-
T
30
)

Deployment

of

web

services

of

the

components

supplied

by

WP
4

to

WP
6




4

Tools to be
integrated

Web Service
wrapper

The Registry

Common
Interfaces

Format
Converters

Workflow
editor and
engine

Sharing
workflows

From local tools

to

sharing workflows

Clients: Java,
Python, Perl, etc.

Platform tools and portals

6


JAX
-
WS,
Axis
, CXF,
etc.

Workflows

Social
Network

Registry

Web Services

Share
tools

(remotely run
distributed tools)

Share and find
Web Services

Call / chain
Web Services

Share and find
workflows

SOAP or
REST

Soaplab

Biocatalogue

Taverna

www.taverna.org.uk

PANACEA Registry
:

registry.elda.org

PANACEA
myExperiment
:

myexperiment.elda.org

myExperiment

PANACEA

Platform
:

uses,

adapts

and

improves

myGrid

tools

for

eScience

(used

in

biology,

social

science,

music,

astronomy,

multimedia

and

chemistry
)
.

Technological option:

Web Services

SOAPLAB
2
(SOAP)



Easy

deployment

of

command

line

tools

as

WS
.

(Java,

Python,

C++,

UIMA
,

etc
.

)



Clients
:

Java,

Python,

Perl,

Taverna
,

etc
.



No

coding

needed!

Only

metadata




Polling”

techniques

for

long

lasting

tasks



Web

form

to

run

the

web

services



URL

input

/

output

ready



PANACEA

improvement

for

SOAP

messaging

(network

usage

and

memory)



PANACEA

limit

multiple

users


TAVERNA

BioCatalogue

Web Services

Workflow
editor

Registry

Social
network

myExperiment

7

Technological option:

Registry

SOAPLAB
2
(SOAP)



User

friendly

GUI



Free,

open

source,

Continuously

maintained




Search

function



Users

rating

(users

feedback)



Service

annotations

and

Language

Categorization

(PANACEA)



Monitoring

system

(web

service

status

and

data

results)

TAVERNA

BioCatalogue

Web Services

Workflow
editor

Registry

Social
network

myExperiment

8

Technological option:

Taverna

SOAPLAB
2
(SOAP)



User

friendly

GUI



Free

and

open

source



Continuously

maintained

(v
.

2
.
4
)




SOAP

and

REST

web

services



Credentials

manger

(passwords,

certificates,

etc
.
)



Multiple

files

processing

(“lists”)



PANACEA

Workflows,

best

practises,

videos,

etc
.

:



Parallelization,

Error

recovery
:

“retries”,

Polling



PANACEA

collaboration
:

bug

fixing

and

pre
-
release

tests

TAVERNA

BioCatalogue

Web Services

Workflow
editor

Registry

Social
network

myExperiment

9

PANACEA

10

Demos


Previous Review:


PANACEA Registry / PANACEA
myExperiment


Run Web Services and Workflows


Design and merging of workflows in
Taverna



Final Review: Specific examples


Creation of a bilingual dictionary


Twitter NLP


Web cleaner and
anonymizer


PANACEA Registry / PANACEA
myExperiment




11

Demos I

Creation of a bilingual dictionary


http://myexperiment.elda.org/workflows/93



Input: Pairs of Basic
Xces

Documents


English:
http://nlp.ilsp.gr/panacea/Bilingual/data/20101222/LAB_EN_FR/www.ilo.org/1.xml


French:
http://nlp.ilsp.gr/panacea/Bilingual/data/20101222/LAB_EN_FR/www.ilo.org/191.xml


1.
Sentence alignment:
Hunalign

(3
rd

party tool) Interoperability



2.
PoS

tagging:
Treetagger

(3
rd

party tool) Interoperability



3.
Build phrase tables: Moses
(3
rd

party tool) Interoperability



4.
Bilingual dictionary extractor



Video:
http://ws02.iula.upf.edu/panacea/examples/videos/Panacea_bilingual_dictionary_extraction_v01.mp4

12

Demos II

Twitter NLP + Registry

(3
rd

party tool)



This

web

service

is

based

on

the

Twitter

NLP

tool

developed

by

Noah's

ARK

group
.



Noah's

ARK

group

is

Noah

Smith's

research

group

at

the

Language

Technologies

Institute,

School

of

Computer

Science,

Carnegie

Mellon

University
.



1.
Search the WS in the Registry

2.
Check monitoring system

3.
Use web client with example data


13

Demos III

Web cleaner and
anonymizer

http://myexperiment.elda.org/workflows/98




Input: a list of URLs to process


Example: a web article from
www.fifa.com


1.
ILSP Web cleaner and text extractor WS

2.
UPF
Anonymizer

WS


Internally calls
Freeling

NER WS
(3
rd

party tool)
Interoperability





14

Video:
http://ws02.iula.upf.edu/panacea/examples/videos/Panacea_web_cleaner_and_anonymization_v01.mp4


WP3 Achievements


Functional and Operational Platform



Multiple tools, webs and features



Ready to use



Usability



Real Users



Interoperability



Common Interfaces



Travelling Object



3
rd

party tools Integration



Format converters



Scalability



Web service scalability: long lasting tasks



Workflow design optimization: robustness



Machine resources: handling parallel requests






15

Functional

and Operational Platform



PANACEA Registry



157 web services


PANACEA WS benefits: WS
are easy to deploy (low maintenance cost)



More than 1300 annotations


Usability / Doc.


A cloud of 164 tags




Monitoring system:
WS up and running
94.82%
since their deployment (97%)


Availability


PANACEA
myExperiment




74 shared workflows




Storage System


Usability



16

Functional

and Operational Platform:

Tutorials and Documentation





17


Tutorials




Specific and General tutorials



More than 12 videos


Usability


Frequently Asked Questions




Documentation



Registry

annotations, tags and Categories





Common Interfaces documentation: xml, web, etc.



Travelling Objects documentation






Functional

and Operational Platform:

Users




18


WP7
Validators


Linguatech

(WP8)


Qualia

(Business intelligence)


CNGL
(
Centre for Next Generation
Localisation
)


INCYTA
(Translation)


Master and
Phd

Students make use of the
PANACEA platform


http://ws02.iula.upf.edu/panacea/statistics/
upf
-
statistics
.html






Three levels of interoperability:


COMMUNICATION PROTOCOLS: Soap, Rest


DATA





PARAMETERS


Format
N

Tool
A

Format
M

Tool
B

Format
L

Tool
C

Format
N

Tool
A

empty

Tool
B

empty

Tool
C

Interoperability

Tool B does not “understand” format N!

All tools understand the previous format

Tool
A

Tool
B

A

B

C

D

A

B

C

D

Tool
A

Tool
B

Y

T

Q

Z

A

B

C

D

20

Common Interface


A Common Interface (CI) defines the mandatory
parameters for every functionality:

PoS
Tagger A

MANDATORY:
input

language

OPTIONALS:

Param

A

PoS
Tagger B

MANDATORY:
input

language

OPTIONALS:

PARAM 1

PARAM 2

http://panacea
-
lr.eu/en/info
-
for
-
professionals/documents/

http://registry.elda.org

21

Travelling Object


The

Travelling

Object

(TO)

is

the

common

data

and

metadata

format

used

in

PANACEA

to

make

components

understand

each

other
.

(
Interoperability
)



TO
1

is

the

minimal

common

vertical

in
-
line

format

used

by

the

deployed

tools

since

the

first

version

of

the

platform

using

XCES

standard


TO
2

GrAF

standard
:

The

Graph

Annotation

Format

(
Ide

and

Sudermam
,

2007
)

is

the

XML

serialization

of

LAF

(ISO

24612
,

2009
)


LMF

for

lexical

resources


CONLL

for

parsers


Converters

and

adapted

WS

outputs



22



Format Converters

31 Format converters
on the PANACEA Registry


Freeling

to
TO
. CNR




http://registry.elda.org/services/207



KAF

to
TO
. CNR




http://registry.elda.org/services/208


Basic
Xces

to txt. CNR




http://registry.elda.org/services/209



PoS

tag. (
Freeling

treetagger
) to
GrAF
. UPF

http://registry.elda.org/services/142



Dependency parsing (
Freeling
) to
GrAF
. UPF

http://registry.elda.org/services/197


Dependency
CoNLL

to
GrAF
. CNR



http://registry.elda.org/services/254



Word

doc to
txt
. UPF




http://registry.elda.org/services/112



In
-
house
mwe

to
LMF.
CNR



http://registry.elda.org/services/296



Pdf

to text. UPF




http://registry.elda.org/services/116



Multi.
encodings

converter (ISO, UTF, etc.). UPF

http://registry.elda.org/services/114



Aligner

to
TO
. DCU




http://registry.elda.org/services/69



Sentence alignment to
TMX
. DCU




http://registry.elda.org/services/219



Treetagger

to
MOSES
. DCU



http://registry.elda.org/services/275



UIMA

to
GrAF
. ILSP




http://registry.elda.org/services/182




METASHARE

metadata

generators


http://myexperiment.elda.org/workflows/96


23

3
rd

party tools integration


PANACEA WS wrapper (
Soaplab
) and the CI make
it easy for WS Providers to integrate 3
rd

party tools.



ILSP tools are
UIMA

tools


UIMA


Freeling





UPC


Treetagger





University of Stuttgart


Twitter NLP




Carnegie Mellon University


MALT Parser




Uppsala University


DeSR





Università

di

Pisa


MOSES / Giza++


DELiC4MT

(MT evaluation)


DCU


Berckeley

tagger, parser, aligner

Berkeley University California


24

Web
Services

Scalability


Web

services

are

being

deployed

using

Soaplab

2
.
3
.
2
:


Service

providers

only

need

to

use

metadata

(ACD)

files



Usability


Web

client

application

to

test

WSs
:

Spinet



Usability



PANACEA

developers

have

been

in

contact

with

Soaplab

developers



Collaboration


SOAP

protocol

standard



Interoperability


WS

can

be

called

from

Taverna

or

other

workflow

editors


WS

can

be

called

with

many

programming

languages
:

Python,

Perl,

Ruby,

Java,

etc
.


Soaplab

polling

to

avoid

client

timeouts



Scalability


PANACEA

Improvements



Scalability


Parallel

request

limit

system




SOAP

messaging

optimization



25

Workflows design optimization:

Robustness


Building workflows with Taverna


Version 2.4.2


Scalability


Polling (Soaplab)


Scalability


long lasting web service calls without timeouts


Retries


Scalability


Parallelization


Scalability


Tutorials and videos


Usability


27

Machine Resources:

handling parallel requests

Parallelization level 3 (3 parallel request per service * 2 services = 6 concurrent requests)



Workflow

name

Freeling_tagging_for_crawled_data_with_output_download

file

massive_freeling_for_crawled_data_v11_download.t2flow

myexp url

http://myexperiment.elda.org/workflows/32

Taverna

2.4.0
workbench

VM

Cores

RAM

HD

iula04
(UPF)

4

8

40GB (SAS)

WS

parall.

poll
.
int
.

poll. backoff

poll. max int.

retries

ini. delay

max

factor

WS1

python_preprocess

+
freeling_tagging

+
python_postprocessing

3

2000

1

10000

2

5000

150000

20

WS2

postagger_to_
xces_converter

3

2000

1

10000

2

5000

150000

20

corpus

list file

urls

url example

Tokens

MCv2

LAB_ES_list.sorted.txt

13188

http://
nlp.ilsp.gr
/panacea/D4.3/data/201109/LAB_ES/1.xml

61 M

Name

Status

Queued it.

It. done

It. w/error

Average time/it.

Freeling_tagging_for_crawled_data_w
ith_output_download

Finished

-

-

-

5.2 h

download_dataUrl

Finished

0

13188

0

31 ms

freeling_tagging

Finished

0

13188

5

4.2 s

postagger_to_xces_converter

Finished

0

13188

0

4.1 s

29

Machine Resources:

handling parallel requests

Parallelization level 10 (10 parallel request per service * 2 services = 20 concurrent requests)



Workflow

name

Freeling_tagging_for_crawled_data_with_output_download

file

massive_freeling_for_crawled_data_v11_download.t2flow

myexp url

http://myexperiment.elda.org/workflows/32

Taverna

2.4.0
workbench

VM

Cores

RAM

HD

iula04
(UPF)

4

8

40GB (SAS)

WS

parall.

poll
.
int
.

poll. backoff

poll. max int.

retries

ini. delay

max

factor

WS1

python_preprocess

+
freeling_tagging

+
python_postprocessing

10

2000

1

10000

2

5000

150000

20

WS2

postagger_to_
xces_converter

10

2000

1

10000

2

5000

150000

20

corpus

list file

urls

url example

Tokens

MCv2

LAB_ES_list.sorted.txt

13188

http://
nlp.ilsp.gr
/panacea/D4.3/data/201109/LAB_ES/1.xml

61 M

Name

Status

Queued it.

It. done

It. w/error

Average time/it.

Freeling_tagging_for_crawled_data_w
ith_output_download

Finished

-

-

-

2
.2
h

download_dataUrl

Finished

0

13188

0

29
ms

freeling_tagging

Finished

0

13188

5

5.9
s

postagger_to_xces_converter

Finished

0

13188

0

4.8
s

30

Machine Resources:

handling parallel requests


From 1x to 10x experiment

http://ws02.iula.upf.edu/panacea/examples/videos/
Panacea_parallelization_scalability_v01.mp4



Two
Taverna

instances running at the same time


100 documents to be processed


1 workflow with NO parallelization / the other with 10x


The same server: ws04 with 8GB RAM and 4 CPUs


More resources > more parallel requests



31

Machine Resources:

handling parallel requests


Conclusions:


PANACEA fulfils large data
scalabilty

goal


Scalability


Requirements:


Robust WS deployment
:
Soaplab

(with Panacea improvements) or other
robust
framewoks
.


Taverna

2.4


Workflow design
must follow the
PANACEA massive data tutorial
(retries, polling, etc)


The architecture is highly scalable: growth is just
a matter of resources


Statistics

Typical
Panacea server
:



2
-

4 cores



4
-

8 GB RAM



30
-

100 GB HDD


100
Freeling

WS parallel
requests


䕍䉌

EBI

(European Bioinformatics
Institute in Cambridge):



200 Servers



2000 cores



Server requests balancing

Software, etc.

More than 50000
Freeling

WS parallel
requests


32

WP7 Evaluation


33

Conclusions


Functional platform




Web services software




Registry /
myExperiment





Usability for users and providers



Interoperability:


Data formats



Common Interfaces



Tutorials and Documentation



Scalability



34

The future


Authentication Web Services


Business opportunity


Institutions and companies can sell their services and/or machine resources


Automatically build workflows


Usability and interoperability


Based on input data and user desired output, etc.


Data Visualization tools / Widgets


Usability


Improve total throughput


Scalability


With more machine resources we can achieve faster experiment results


Software optimization: task splitting and parallelization


Publications with experiments


Research


Researchers could link their publications to real experiments (WS, workflows, data.
etc.)


Fostering research making experiments easily replicable


Improved experiments: more data, more machine resources, faster results, etc.



35



Thank

you


Questions
?


36