Language Technologies for Multilingual Societies

tearfuloilMobile - Wireless

Dec 10, 2013 (3 years and 6 months ago)

60 views

Language Technologies for Multilingual
Societies


META
-
FORUM 2011,

June 27/28, 2011,

Budapest, Hungary

Swaran Lata


Director & Head, Technology Development for Indian Languages Programme

& Country Manager, W3C India

Govt. of India

6 CGO Complex, Lodi Road, New Delhi 110 003

Meta forum 2011

1

Diverse Multilinguality in India and its Complexity

Meta forum 2011

2

Organization of my talk




Why and How TDIL Programme got initiated



Important Milestones


Technology Development


Multilingual Standards


Proliferations



Lessons Learnt



Problems Arising out of Multilingualism



Funding Vs. Long
-
term Goals



Potential for Collaboration

Meta forum 2011

3


Constitution

of

India

(
8
th

Schedule

Covers

22

Indian

Languages)


Emphasize

on

planned

development

of

Indian

languages

for

use

in

all

spheres

of

life
.


Development

and

use

of

Indian

Languages

in

all

domains

of

National

life

to

maintain

linguistic

and

cultural

diversity



Development

of

sustainable

technologies

to

break

linguistic

barriers

across

diverse

speech

communities



Provide

equal

opportunities

to

citizens

through

the

use

of

Information

Technology



Official

languages

Act

1963



Hindi

as

Official

Language

of

Republic

of

India



15

Indian

Languages

(ILs)

in

8
th

Schedule




3

ILs

added

in

1992

(Konkani,

Manipuri

and

Nepali)



4

ILs

added

in

2003

(Bodo,

Maithili,

Dogri,

and

Santali)

Multilingual and Multicultural India

Meta forum 2011

4

Why and How TDIL Programme got initiated

DoE (1976)


Year 2000

MIT

Year 2002

MCIT

=
DIT

+ DoT

Technology Development Council (TDC) 1988


1991



Funded Project for Development of Devanagari
Graphics and Intelligence based

Script Technology
(GIST) UNIX Terminal at IIT Kanpur



Exploiting phonetic correspondence of Indian languages


GIST extended to
others Indian Languages



GIST Card (PC add
-
on card) developed at CDAC Pune (Society set up in 1988)



Indian Standard Code for Information Interchange (
ISCII
)



BIS: 13194 (1991)


8 bit encoding and keyboard layout standard covering 15 languages.

Department of
Electronics

Ministry of Information
Technology

Ministry of Communication &
Information Technology

Department of Information
Technology

Department of Telecommunications

Meta forum 2011

5

Technology Development for Indian Languages (TDIL)
Programme


Milestones

1995

2000

2005

2009

2011

Seeding Phase

Capacity Building
phase

Multilingual

Technology

Development

Future Roadmap

PoC Research in Hindi
and monolingual
Corpora building

Set
-
up Resource
Centres in each state
, Mentoring through
existing projects




Consortium

Mode



Multiple

Institutional

Projects

in

MT,

OCR,

OHWR

,

CLIA

&

Speech


Multilingual

Resources

Development

based

on

standards


Free

BIPKs

for

22

ILs


Major Thrust on
Research in
Speech and
Mobile Area



Productization
Efforts


Standards for
Multilingual Web


Addressing
language specific
bottlenecks


Localization
Initiatives



Meta forum 2011

6

8

85

13

31

Growth of Language Technology Research Institutions

Meta forum 2011

7

Machine Translation System [1995
-
2010


Consolidation]



English

to

Hindi

Machine

Translation

System

has

been

deployed

in

Parliament

for

Machine

Translation

of

the

Parliament

Proceedings
.





Matching

Efforts

in

Integrating

the

MT

system

into

organizational

Workflow

&

Training

of

the

staff


Improvement

in

quality

and

speed

of

translation

service


English

to

Indian

Languages

Machine

Translation

System

in

3

Indian

Languages



Hindi

,

Bengali

,

Malayalam

--

to

translate

the

Voluminous

Course

Material

of

Vocational

Training

Programme
:




Reduces

cost

of

translation

by

30
%


Saves

Human

Effort

by

more

than

50
%


Beta Deployments :

Meta forum 2011

8

Machine Translation Systems:
-


Eng.
-

Indian Languages


8 Language Pairs


The

Machine

Translation

Systems

has

been

made

available

through

TDIL

Data

Centre

(
http
:
//

www
.
tdil
-
dc
.
in
)
for

feedback

and

improvisation

through

crowd

sourcing
.

Machine Translation System [1995
-
2010


Consolidation]

Meta forum 2011

9

Machine Translation Systems:
-


Indian Languages .
-

Indian Languages


6 Language Pairs

Machine Translation System [1995
-
2010


Consolidation]

Meta forum 2011

10

Cross
-
lingual Information Access [since 2006]


Across

six

Indian

Languages

:

Hindi

,

Marathi

,

Bengali

,

Punjabi

,

Tamil

and

Telugu
.

;

Tourism

Domain



Index

based

searching

based

pre
-
processing

of

Indian

Language

query

[precision

@
5

=

0
.
4

to

0
.
5
]
.



UNL

based

search

tried

in

Tamil

to

compare

the

efficacy

.



[

Precision

based

on

Indexed

based

search

=
0
.
42

;

UNL

based

search

=

0
.
59
]

.



Next

3

years

target

:


Enhance

precision

to

0
.
7


Addition

of

3

languages

[

Assamese,

odia

,

Gujarati]



Beta

Trial

proposed

on

existing

search

engine
.

Meta forum 2011

11

Optical Character Recognition [since 2006]


11

Indian

Scripts


Accuracy

-

Character

level

97
%

;

Word
-
level

80
-
85
%


Working

on

printed

documents

between

1960

-
2000


Response

time

:

3
-
4

Minutes


Next 3 years target :


Word
-
level > 90%


Handling bi
-
lingual documents [IL + English]


Multi
-
column layout support


Post Correction Tools


Braille Interface development and deployment for Indian
language book publishing


On
-
line OCR service through TDIL Data Centre


Deployment at a Historical Library

Meta forum 2011

12

On
-
line Handwriting Recognition System [OHWR]
-

since 2006


Across

six

Indian

Languages

:

Devanagri,

Kannada,

Malayalam,

Bengali

,

Tamil

and

Telugu
.

;



SDK

developed


Stroke

Level



95
%


Character

Level



84
%



Census

Data

Collection

stored

as

Unicode

Database


Next 3 years target :


Achieve

complete

Coverage

of

Conjuncts

&

Complex

Characters,

Nukta

characters

Integration

with

TTS

and

deployment

for

Speech

Impaired



Addition

of

new

languages

[Assamese,

Urdu,

Marathi,

Manipuri,

Bodo]



Beta Trial proposed on existing search engine.

Meta forum 2011

13

Text
-
to
-
Speech in Indian Languages [since 2006]


Based

on

Festivox

Frame

Work


TTS

Engine

Integrated

with

NVDA

(Windows)

and

ORCA

(Linux)

screen

readers


Mean

Opinion

Score

:

Hindi

3
.
2

,

Bengali

,

Marathi

,

Telugu

,

Tamil

,

Malayalam

:

~
3
.
0



Training

of

Visually

Challenged

Persons

on

screen

readers
.


Next

3

years

target

:


Improvement of MOS Score of TTS engine up
-
to 3.8


4.0


TTS engine for Indian Languages for Mobile Android Platforms


Addition of 5 New Indian Languages


Odia, Gujarati, Assamese,
Bodo


Proof of concept for adaptation for one Hindi Dialect
.

Meta forum 2011

14

ORCA Screen Reader integrated with IL TTS

Meta forum 2011

15

Multi
-
lingual

Standards


Multilingual Standards


Multi stake holders

Meta forum 2011

UNICODE

ISO

Encoding


Web
Content ,
architecture
and Web
Based
Services

W3C

Language
Tag , Ref
Glyph set ,
Key
-

Board

ISO

UNICODE

Locale
Data

ELRA , NIST ,
LDC

Linguistic
Resources ,
Tools and
Evaluation

Internet
Protocol and
Domain
Name

ICANN , IANA ,
IETF , ISOC

15

Meta forum 2011

16

15

15

15

15

15

15

15

1

3

6

8

12

15

18

19

21

22

22

22

1

3

6

9

11

3

4

4

5

6

1

2

3

3

5

7

7

0
5
10
15
20
25
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
Standardization Activity

ISCII
UNICODE
CLDR
W3C
ICANN


Transition

from

ISCII

to

UNICODE


UNICODE

Standard

adopted

as

default

E n c o d i n g

S t a n da r d

for

E
-
Gov

Applications



CDLR

for

9

Languages

in

CLDR

2
.
0


Work

for

r e s t

of

Indian

Languages

initiated

No of Languages/Standards components

Year

W3C Work initiated in 5
areas :
Internationalization CSS,
Mobile Web , E
-
Gov and
Speech

Standardization Activity for Indian Languages

Meta forum 2011

17

UNICODE Completed for 22 Official Indian Languages and Vedic Sanskrit
-

Unicode 6.0

Devanagari

Bengali

Malayalam

UNICODE

18


Encoding



Included in Unicode 6.0


Code Point 20B9 [August 2010]



Included in ISO 10646
-
1 [ Oct 2010]



Included in ISCII


Notification issued by BIS

Key Board








`





Key Combination


CTRL + ALT+4 or AltGr + 4


Consensus by all stake
-
holders and major industry
players


ISO
-

14442
-

Notification issued by BIS
[Dec 1, 2010]


Software Patches released by Microsoft , Redhat , C
-
DAC
[April 2011]

Fonts

Sakal
-
Bharti font for New Rupee Symbol

Meta forum 2011

19

Enabling of New Rupee Symbol in ICT environment

[Govt. Notification in July
-
2010]

Common Locale Data Repository Completed in 9 Indian Languages
-

Included in CLDR 2.0

Work for Rest of the Indian Languages in Progress for their inclusion in the next version of CLDR

Most of the Changes suggested by Govt. of India accepted by Unicode consortium.

Screen shots of CLDR Hindi Updation

Screen shots of CLDR Bengali Up
-
dation

CLDR

Web Standards
-

W3C

Standards

Work Initiated

Progress So far

Cascading

Style

Sheet

(CSS)










Hindi

Listing

submitted

to

W
3
C



Akshara

Definition

for

Indic

Languages

requirements

of

text
-
segmentation

of

CSS

specification




Detailed

Testing

of

CSS

2
.
1

underway



Pronunciation

Lexicon

Specification

(PLS)

and

Speech

synthesis

Mark
-
Up

Language

(SSML)


Reference

Phoneme

set

development


IPA

verification

in

Indic

languages


Acoustic


phonetic

analysis


Initiated

for

Hindi

,

Bengali

,

Punjabi


IPA

verification

for

Bengali

completed

Mobile

Web



Gap Analysis for Mobile Web in
Indian Languages



Mobile Fonts and Rasterization
Engine in Indic Languages


Mobile OK Checker




Proposed to Work with
Telecom Centres of Excellences
in India.


Mobile Industry Associations


E
-
Gov

Best

Practices



Internationalization Best Practices
for Indic Languages

Draft developed and under
finalization.

Web

Accessibility


Adoption of W3C WCAG 2.0 standard in
India

Incorporation of WCAG 2.0 into
National Electronic Accessibility
Policy.

Lessons Learnt


Language Resource Development


Copyright issues


Standardization of Meta data and Tag sets


Language specificities


Validation vs. Time and Cost investment


Investment in Semantic and Syntactic Resources like Word
-
Net, Tree banks
etc respectively



Language Independent Methodologies


Core Technology Development engine identification


Availability of Researchers and Scientific manpower


Domain Selection


Limited technology institutions

Meta forum 2011

22


Leadership Issues


Computer Science Experts vs. Linguistic Experts


Multi Institutional Consortia Project Leadership


Development plan vs. Budget plan vs. National five year plan


Researchers in Academics



Language dependent planning



Language selection criteria



Participation of State Language Departments



Availability of Institutions



Availability of Linguistic and Language Experts

Lessons Learnt

Meta forum 2011

23



Standardization Issues


Development level Standards


Third party testing


Software engineering practices


Use case scenario


Integration issues


Other Issues


User involvements


Limited deployments


Models for proliferation


Lab to Pilot to Commercial


Divergent requirements of GenX and non ICT communities

Lessons Learnt

Meta forum 2011

24

Problem rising from Multilingualism


Multiple language speakers (Native language, Hindi and English)


English default language of official communication and higher education and also spoken
language in urban and semi urban areas


Orthographic complexity


Tamil language having lesser alphabets


Conjunct and Glutenation problem


Reforms in orthography


Spoken language issues:


Phonetic variation among Indian languages


Variation of Hindi spoken in 7 to 8 states


Dialect variation (Awadhi, Bhojpuri, Khadi boli, Braj Bhasha etc)


The

paradigm

shift

to

statistical

approaches
:


Huge

amount

of

speech

corpora

capturing

dialect

variation


Parallel

text

corpora

and

other

language

resources


Interfacing

from

multilingual

language

resources



Cross

lingual

access

25

Funding vs. Long Term Goals

TDIL Budget Expenditure (Rs. In Lakhs)
0
1000
2000
3000
4000
5000
6000
7000
8000
1991-1995
1995-2000
2001-2005
2005-2010
2010-2012
Years
Budget Expenditure (Rs. In Lakhs)
Budget Expenditure
TDIL Budget Expenditure (Rs. In Lakhs)
141.20
800.00
2750.00
6419.71
7000.00
0.00
1000.00
2000.00
3000.00
4000.00
5000.00
6000.00
7000.00
8000.00
7th
8th
9th
10th
11th
Years wise Plan (1991-2012)
Budget Expenditure
Budget Expenditure
TDIL Budget
0.000
500.000
1000.000
1500.000
2000.000
2500.000
3000.000
3500.000
4000.000
1991-92
1992-93
1993-94
1994-95
1995-96
1996-97
1997-98
1998-99
1999-2000
2000-01
2001-02
2002-03
2003-04
2004-05
2005-06
2006-07
2007-08
2008-09
2009-10
2010-11
2011-12
Years
Buget Expenditure (Rs in Lakhs)
Budget Expenditure
Multilingual
consortia

Social
impact

Meta forum 2011

26


Graphs infer that optimal funding is available


Language activities have crossed threshold


Next plan (12
th)

higher allocation of resources targeted


More Language groups need to be funded in each state with special focus on small
language resources



Multiple script issues

Funding vs. Long Term Goals

Time Frame


future challenges for five years


Replication of successful technology development for newer languages


Improvisation of language technologies:


Improve accuracy to bring it a usable level


Productization efforts


Porting efforts on mobile platforms


Providing services on cloud based services


Strategies for social impact

Meta forum 2011

27

Potential for Cooperation


Enhancement

and

Adaptation

of

engines

like

sphinx,

festival,

HTS,

NUTCH

harfbuzz,

free

type

etc
.

to

bring

a

paradigm

shift

in

development

form

Latin

centric

to

Multi
-
lingual

centric
.


Pilot

projects

to

try

methodology

applied

for

Indian

languages

to

European

language

and

vice

versa


Angla
-
Bharati

English

to

Indian

languages

MT

Framework

may

be

tried

for

English

to

other

European

Languages



Replicating

European

localization

models

for

taking

localization

technologies

to

users

in

India
.


Cross
-
lingual

Information

Retrieval

between

Indian

Languages

and

European

Languages
.


Collaborative

Effort

on

Speech

Technology

development

in

Indo
-
EU

Languages



new

research

frontiers

in

speech

modeling

,

Speech

recognition

grammar

,

Phonetic

Search
.


Speech

Enabling

of

Mobile

Devices

in

Indo
-
EU

Languages

involving

the

mobile

manufacturers

and

innovative

product

development

for

mass

market

applications


Linguistic

Resource

Sharing

for

Research

Purpose
.


Language

Technology

Evaluation

Models

in

Indian

Language

Technology

/

Product

/

Solutions

based

on

Successful

European

Models

28

Thanks & Questions

slata@mit.gov.in

91
-
11
-
243635
2
5












കൂ