Holly Quran-Based Arabic

trainerhungarianAI and Robotics

Oct 20, 2013 (3 years and 7 months ago)

113 views

Holly Quran
-
Based Arabic
Text to
Speech

ميركلا نآرقلل
ً
ادانتسا ظوفلم ملاك ىلإ صنلا فيلوت

Supervisor:



Assistant
Professor
Dr

Radwan

Tahboub

Co
-
supervisor:


Associate
Professor Dr.
Labib

Arafeh


Bana
Akram

Al
-
Sharif

April,
2013


Holly Quran
-
Based Arabic Text to

Speech


Text to Speech Synthesizer


Arabic Text to Speech


Holly Quran
-
Based Arabic Text to Speech


2

Problem Statement


For any diacritic Arabic written text
,
output its
voice transcription with maximize naturalness
and intelligibility.


3

Arabic TTS Introduction

Why
and Where
TTS?

Why?

Speech is the most natural and widespread form of human communication. There
are several advantages of a high quality text to speech synthesis
system and until
now available methods haven’t given good results
[
12
]


Where?


Language education
[
1
][
5
][
11
]


Talking books and toys
[
1
][
5
][
11
]


Aid to handicapped persons
[
1
][
5
][
11
]


Telecommunication services, relay services
[
1
][
5
]


Vocal monitoring
[
5
]


Multimedia, man machine communication
[
5
]


5

What is TTS
?

Text

Speech

Text to Speech
System

6

What is TTS
? (
cont
;)


A
text to speech synthesizer is a computer based system that should be
able to read any text whether it was directly introduced into the computer
or through character recognition system (OCR). And speech should be
intelligible and natural
.
[
2
]


TTS can be defined as automatic production of speech, through grapheme
to phoneme transcription of the sentences to
utter.
[
12
]



7

Typical TTS Components









Two components


Natural Language Processing Module (NLP)


Digital Signal Processing Module (DSP)

Text to Speech Synthesizer

Text

Speech

Natural
Language
Processing

(NLP)

Digital Signal
Processing

(DSP)

Phonemes

Prosody

8

Natural Language Processing Module

Pre
-
Processor

Morphological
Analyzer

Contextual
Analyzer

Syntatic

Prosodic Parser

Letter to
Sound
Module

Prosody
Generator

Text Analysis

To DSP
Module

Text

Symbolic
Linguistic
Units

9

Natural Language Processing
Module
(
cont
;)

Grapheme
-
to
-
phoneme
conversion: Refers
to the task of finding the
pronunciation of a word given its written form.


Letter to sound (Grapheme
-
to
-
phoneme) transcription methods:
-



Dictionary
-
based
methods
[
1
]


Rule
based
[
1
]


Data
-
driven transcription
[
5
]



10

NLP:
Dictionary
-
based methods



Large dictionary containing all the words of a language and their
pronunciations is stored by the program ( or morphemes to reduce the size
of dictionary)
[
5
]


It’s a replacing each word after looking up in the dictionary by its spelling.


For Arabic Language Al
-
Qamous

Al
-
Muhiet

has over
250

000
entries, for
most roots there
10
-
11
inflected form
[
5
]
, Dictionary based method is not the
best choice.



drawbacks

advantages

Complete fail

if word not found in the
dictionary

Quick

Big memory space require

accurate

Insufficient

for inflected languages
(
eg
. Arabic)

Suitable for irregular spelling

language (English)

Improve

intonation and naturalness

11

NLP
: Rule Based


Pronunciation rules are applied to words to determine their pronunciations
based on their spelling.


Expert knowledge based systems


For diacritic
Arabic
Language, which considered relatively as simple
spelling system
[
5
]
( not irregular), the rule based transcription system with a
dictionary for exceptional words is a good choice.


drawback

Advantages

Needs well defined rules to overcome
all words

Works for any word( no complete fail)


Needs a dictionary

for irregular words

Save

memory

Suitable for not irregular (regular)
spelling

language (Arabic) and
pronounciation

12

تايلوؤسم نم اذه
بئارضلا عفاد

Preprocessing

Tokenize=
اذه
\
نم
\
تايلوؤسم
\
عفاد
\
بئارضلا

Sentence type: <>!,? Then =‘statement’

Replace abbreviation= none

Text Analysis

Exceptional words
database)
)

Replace
اذه
-
>
اذاه

اذاه
\

نم
\
تايلوؤسم
\
عفاد
\
بئارضلا

Letter to Sound Rules

Grapheme to phoneme


-
Rule
1
قيرفتلا فلا
= none

-
Rule
2
لا هيلي دم فرح
= none

-
Rule
3
ةطوبرملا ءاتلا
= none

-
Rule
4
ةدشلا

بئارضلا
=
بئار
َ
ض
ْ
ضلا

,


مم

=


م
ْ
مم

,
تايلوؤسم
=
تا
َ
ي
ْ
يلوؤسم

-
Rule
5
ةروصقملا فللاا
=none

-
Rule
6
نيونتلا
= none

-
Rule
7
ةنكاسلا نونلا

تايلوؤسم نم
=


مم


َ
َ
تايلوؤ
ْ
س

-
Rule
8
بلاقلاا

-
Rule
9
نيكست
رخإ

ةلمجلاب فرح


ْ
بئارضلا

-
Rule
10

ط مث
ْ
ض

= none

-
Rule
11

ت مث
ْ
د

= none

-
Rule
12
ليكشتلا مدع وا نينكاس ءاقتلا
!
= none

-
Rule
13
ةيسمشلا لا

عفاد
بئارضضلا

=
>
عفاد
بئارضض

(
اهنلا

ةلمجلا طسو يف
)

-
Rule
14
ةيرمقلا لا

= none

-
Rule
15
ت مث
ْ
ض

= none

-
Rule
16
آ

= none

-
Rule
17
لصولا ةزمه

= none

-
Rule
18
قيقرتلا و ميخفتلا

=
ا ر ض
#

Syllabification Process

Cvvcvv:cvccvccvvcvccvvcv
/
cvvcvc
v
/
ccvcvvcvc

اذاه

َ
م
ْ
مم

تا
َ
ي
ْ
يلوؤ
ْ
س



عفاد
بئار
َ
ض
ْ
ض

Stress Rule

Prosody Generator


Synthesized Speech

cvv
(WS)|
cvv
((PS):
cvc
(PS):
cvc
(
ws
)….

ha:Da
: min
mas?u:li:a:ti

da:fi
?’
i

d’#d’#
a#r#a
:#?
ib

ATTSIP
[
11
]

NLP
:
Data
-
driven
transcription


Pronunciation by Analogy (
PbA
): “statistical method based on stochastic
theory and nearest neighbor, based on Neural Networks”
[
5
]
: its idea is to
determine the pronunciation of a novel word from similar parts of known
words and their pronunciations.


Trained Neural Networks using Multilayer perceptron (MLP) with back
propagation for training. Language independent model but handling of
graphemes clusters and syntactic features

[
1
][
6
]

14

Terms


Word (cat)
(
اذه
)


Phonemic transcription /
k∂et
/ /
اذاه
/


Phonetic Transcription [k
h


et] [
ha:Da
:]


Phoneme: distinctive sound in a language.


Allophone: one of a set of multiple possible spoken phones used
used

to
pronounced a single phoneme.
Eg
, p


[p]:spend, [p
h
]: pin
[
8
]
.

ةققرم وا ةمخفم ر


Phone: sound
[
11
]


Diphones (units begins in the middle of the stable state of a phone and end in
the middle of the following one)
[
1
]


Syllable: short [CV]
eg

[


ب
], long [CVV]
eg

[
اب
],[CVVC]
eg

[


باب
]
[
11
]


Half
-
Syllable


Triphone

(complete central phone)
[
1
]

15

Digital Signal
P
rocessing Module

DSP Classification Groups:
[
3
]



Articulatory Synthesis


Complex method to module the human speech production system


Concatenative

Synthesis


For one speaker uses different length prerecorded samples. Memory
issues.


Formant Synthesis


Control all parameters in detailed to module the transfer function of the vocal
tract based on source filter model


16


Speech
synthesis
systems classification


Concatenation
synthesizers
(based on the
concatenation (or stringing together) of segments of the recorded
speech)
[
4
]


F
ormant
synthesizers
(
controlled by rules, and do not use human
sound
[
1
][
4
]


17

DSP: Concatenation


constructing
database for prerecorded segments
for Arabic is
to determine
all
possible in Arabic language depends on
[
7
]


Words!


Diphones


Di
-
diphones

:
come from a
middle
syllable , from the middle portion of vowels
.
[
5
]


Phone


Record the corpus


Segmentation and annotation, the database registered
must
be prepared
for the selection
method
[
7
]


concatenative

synthesizers
produce more natural speech than Formant
synthesizers, but with very large database.


18

DSP: Concatenation (
cont
;)

لاق

D
atabase

اق

ل

Integration
and
smoothing

لاق

19

DSP: Formant
synthesizers


Named as: Rule based Synthesizer


Needs well, generative knowledge of the phonation mechanism and
natural speech characteristics
[
1
]


Rules synthesizers describe speech as the dynamic evolution of up to
60
parameter
[
1
]


The large number of (coupled) parameters complicate the analysis and
may produce analysis error.


So formant frequencies X bandwidth should be estimated from
SPEECH
DATA


It can switch from a voice into another. May change in speaking style
(prosodic changes) but rules introducing a high degree of naturalness are
still to be discovered.

20

Holly Quran
-
Based Arabic
Text to Speech

C
hallenges



Text normalization


Text to phoneme challenges


Prosodic and Emotional content


Evaluation
:
No
universal agreed objective evaluation criteria


Limited resources (embedded device) requirements

22

Arabic

Language


Diacritic Arabic language considered as regular spelling
language in comparison with French and English
.


23

Why Holy Quran?


Taking into account:


Intonation and rhythm


Prosodic


Evaluation


Lack of Standardizations in testing and assessment


Our Holy Quran can solve these problem: rules in reading and pronunciation

(
ديوجتلا ماكحأ
)
, no regional Intonation, measurements to evaluate and its
deterministic limited stander by itself.

The hypothesis:


ةميلس ةقيرطب يبرع صن يأ ةءارق عيطتسي ،ةميلس ةءارق ميركلا نارقلا أرقي نم


24

Thesis Scope

A holy
Q
uran (with
Tajweed
)
-
trained system may increase the naturalness of
A
rabic text to speech of holly Quran . This will be
checked

using dynamic
extraction to extract features from the voice together with text. Our main
scope is to increase the naturalness of the synthesized speech by increasing
the features elicited from human(Quran Reader or
Shiekh
) voice. This
methodology can then be used and tested with Standard Arabic (SA) texts.

25

Working plan

1
-

Implement ATTSIP
[
1
]
(Arabic
Text to Speech Including prosody) Using its
static rules.
The
output will include the prosody and the stress level and syllable
depending in
Soukon

mark
(

ْ
َ
).



26

Working plan


2
-

Implement the mentioned ATTSIP using dynamic
feature extraction.
The idea here is to
make a comparison between the static
-
rule based system and
dynamic
-
feature extraction
system.









a
.

We
may use
Neural Networks with back propagation to learn/training stage
.




b
.

Application and testing stage. Here, we will test the sentences used in stage
1
in testing level. On the other hand, we will test extra examples/sentences, which also used in
stage
1
. We will compare the quality and correctness of the output from this stage with the
featured text as phonetic level conclude from stage
1
. We predict decreases in quality for
ATTSIP testing samples. On the other hand, we predict increase in quality in the other
sentences, and then we will calculate the overall quality of the system
.
(contribution
1
)


27

Working plan

3
-

Design
and Implement our Holly Quran based text to voice system. The input
for training is a sample text from Holly Quran with an audio file. Then,
processing the voice to its text to enter the NNs with back propagation to
extract the features from these text and audio. Then we will test the design
with all sentences used in stage
1
again. Then we will compare the three
designs
.
(contribution
2
)




The
judgment of the quality of the output of TTS classified depending on
two parameters: the naturalness (means how much the output is similar to
human voice) and intelligibility (means how much easy to understand the
output). The parameters we will use will be tangible/measurable like number of
faults and nontangible like naturalness.


28

Quran Audio
Sample

Quran Text
Sample

Voice to text
(Parameters/
features
Extraction)

Dynamic
Analysis Tool


(DNN, HMM,
else)

Training Phase

Our Approach (
1
)

29

Digital
Signal
Processing

Diacritic
plain text,
Quran

Preprocessing

Text Analysis

Letter to sound
rule

Syllabification

Stress Rule and
prosody
generator

Database for
exceptional
words

Dynamic
Analysis Tool

Synthesized
Speech

P
honetic
L
evel
(Symbolic
Linguistic
Units)


Natural
Features

Trained
Engine

Working Phase

Our Approach
(
2
)

30

References


[
1
]”An Introduction to Text to Speech Synthesis”, Book, Springer,
2001


[
2
]

An Initial Comparative Study of Arabic Speech Synthesis Engines in
iOS

and Android, Nora B. Al
-
Saud, and Hind. ACM
411
, Saudi Arabia,
2012


[
3
] “A prototype of an Arabic
Diphone

Speech Synthesizer in
Festiva
”,

Maria
Moutran

Assaf
,

Master Thesis,
2005


[
4
]An
Initial Comparative Study of Arabic Speech Synthesis Engines in
iOS

and Android, Nora B. Al
-
Saud, and Hind. ACM
411
,
Saudi Arabia,
2012


[
5
]
Phonetization

of
Arabic:rules

and algorithms El
-
Imam
Yousif
,
2003
Science
Direct


[
6
]

Diphone
-
Based Arabic Speech Synthesizer for Limited Resources Systems,
Nuha

Odeh
,
AlQuds

University, Fall
2012
,
2013


[
7
]

Di
-
Diphone

Arabic Speech Synthesis Concatenation ,
Abdelkader

Chabchoub
, et all. International Journal of Computers
& Technology.
2012


[
8
]
http://www.azlifa.com/pp
-
lecture
-
8
/

accessed on
20
/
4
/
2013


[
9
]
A Text to Speech System for Arabic Using Neural Networks,
Sassi

S et all,
1999
IEEE


[
10
] Neural
Speech Synthesis System for Arabic Language using CELP Algorithm.
Sassi

S. et all,
2001
IEEE


[
11
] Arabic
Text
-
To
-
Speech Including Prosody (ATTSIP) for Mobile
Devices
, Ahmed Ismail
Elothmany
,
AlQuds

University ,
2013


[
12
] Neural Speech Synthesis for Arabic Language using CELP Algorithm,
Sassi

S. et all, IEEE,
2001

31

Thanks for listening

Questions, feedbacks or notes?

Arabic NLP Appendix

Arabic Text
-
To
-
Speech Including Prosody
(ATTSIP)


34

Pre
-

Processing Module


35

Pre
-

Processing Module( cont)

36

Text Analysis Module


This module concern with handling of exceptional words which the
letter
-
to
-
sound rules cannot apply to them to get its phonemes.

37

Letter
-
To
-
Sound Rules (example)


If the definite article

"
لا
"

is preceded by a long vowel, then this
long vowel will be replaced by a short vowel.



If the word ends with
atta

almarbouta


ة

“, here there are two
situation:
-


If this word in the middle position of the sentence,
atta

almarbouta

will be
replaced by Atta letter.


If this word in the last position of the sentence,
atta

almarbouta

will be
replaced by
Haa

letter
.


38

What is prosody?


Prosody is the study of the tune and rhythm of speech and how these
features contribute to meaning.





Prosody may reflect various features of the speaker



The utterance is a statement, a question, or a command.


Whether the speaker is being ironic or sarcastic; emphasis, contrast and focus.


39

Syllabification Process



In order to including prosody, it is not only necessary to convert the
letters into phonemes, but also to syllabify the word and to assign
word stress.



The allowed syllables in Arabic language are presented in the
following Table , where [V] indicates short vowel and [VV] indicate
long vowel whereas [C] indicates a consonant (Zeki, KhAlifa, Naji,
2010
).


40

Rules for Syllabification


Each geminated letter occurs at the end of a syllable and starts a new
syllable.



Syllables never begin with vowels.



Each syllable contains only one short or one long vowel.



Sokoun

always signifies the end of a syllable.



The input to the syllabification module is array of tokens which contains
the words that forms the Sentence and the output is a string of syllable
take the form of ( cv
cvv

cv
cvvc

,……
etc
), where the “c” refers to a
consonant and the “v” refer to short vowel and the “
vv
” to long vowel.

41

Syllable combinations


If we get a syllable length=
3
, this mean that this syllable is one of (cvc
or cvv) only, if syllable length=
4
this mean it can have one of the form
(cvvc, cvcc, cv:cv) and so forth.

42

Putting stress degree between syllables


The position and the distribution of the stress depend on the number
and the types of syllables contained in the word. The rules which
govern its place are defined as follows (
Chentir
,
Guerti
, and
Hirst
,
2009
):



If a word consists of a sequence of short CV syllables then the first syllable
gets the main stress and the rest of the syllables get the weak stress.



If a word has one long stress and others syllables are of the type short then
this long syllable will has the main stress and the others get the weak stress.



If a word contains two or more long syllables, then the nearest long syllable
to the end will get the main stress, the one in the middle will take the
secondary stress and last the first long syllable will be the weakest.


43

The Speech Assessment Methods
Phonetic Alphabet (SAMPA) for Arabic

44

Phonetic transcription for diacritics

45

The framework


46

Test to Speech Synthesizer

English

Franc
e

Mand
arin

Japan
ese

Germ
any

Bodo

Arabic Language

MSA

Spoken

ىحصفلا

Natural Language
Processing

Digital Signal
Processing

We Are
Here

AUTOMATICALLY CLUSTERING SIMILAR UNITS FOR
UNIT
SELECTION
IN SPEECH
SYNTHESIS[
1997
]


Black A.
et
all introduce a new method in TTS domain. They focus in
concatenation method in Digital Signal Processing module(DSP). The
method is to automatic clustering sub
-
words (as they named) which could
be uniform or non
-
uniform units as diphones or phones or any other unit.
Thus, these sub
-
words are grouped (clustered) in the database based on
their phonetic and prosody. With this clustering selecting the appropriate
unite defined
b
y its label is found in optimal path throw the database tree
and pruning non
-
promising paths. They argue that implementing this
method to TTS full system increases the efficiency of natural output speech.


For our model, DNN ability in generalization should replace this method in
my small knowledge about neural networks. We


in this thesis
-

are not
concentrating in the algorithm to select the “.
wav”s

from database. We
want to produce the phonetic units with features as possible to increase
the naturalness. I wish to discuss this with my supervisor and
Dr

Hashem
.

47

AUTOMATICALLY CLUSTERING SIMILAR UNITS
FOR UNIT

SELECTION
IN SPEECH
SYNTHESIS

48

Text to Speech Synthesizer

Text

Speech

Phonetics

Prosody


ِ
ع
ِ
فاد

اد


ِ
ع
ِ
ف

Integration
and
smoothing


ِ
ع
ِ
فاد

da:fi
?’
i

.wav

da

fi?’
i

Black A.
et all;
1997

Clustered
database
depends on
phonetic
and
prosody
context

Natural
Language


Processing (NLP
)

Digital Signal
Processing

(DSP)

Database

da, la,
ma..
etc