A New SVM-based Mix Audio Classification - Centers

grizzlybearcroatianAI and Robotics

Oct 16, 2013 (5 years and 3 months ago)


40th Southeastern Symposium on System Theory MC2.1
University of New Orleans
New Orleans,LA,USA,March 16-18,2008
ANew SVM-based Mix Audio Classification
Pejman Mowlaee Begzade Mahalel,Mahsa Rashidi2,Karim Faez3,Abolghasem Sayadiyan4
"2"3'4PhD student at Department ofElectrical Engineering Amirkabir University ofTechnology,
Dept.ofElectrical Eng,15875-4413,Hafez,Tehran,Iran
Emails:P Mowla(s kfaez aut.ac.ir;
Abstract- A preprocessing stage in every speech/music noise [2-3].Many features are used including Cepstrum [4],
applications including separation,recognition and transcription Support Vector Machine (SVM) [5] and power spectrum
task is inevitable to determine each frame belongs to which classes,and Zero-crossing rate (ZCR) proposed in [6] using SVM.
namely:speech only,music only or speech/music mixture.Such Although different classification methods were compared,it
classification can significantly decrease the computational burden is observed that the choice of features seemed to be more
due to exhaustive search commonly introduced as a problem in important than the choice of classifiers.The selected
model-based speech recognition or separation as well as music features have a much larger effect to the recognition
transcription scenarios.In this paper,we present a new method to
separate mixed type audio frames based on Support Vector accurac than theslce classifiers Howr,the
Machine (SVM).The challenging problem in this work is seeking SVM was observed to be the best classifier for musical
the most appropriate features to discriminate these classes.As a genre recognition.Hence,to enhance efficiency,effort goes
result,we propose some novel features based on eigen- into finding features that separates the classes well,instead
decomposition which presents acceptable classification result.The ofusing a complex classification model.
experimental results show that the proposed system outperforms In this paper,we propose an audio classification scheme
other classification systems including k Nearest Neighbor (k-NN),which will categorize audio based on a number of audio
Multi-Layer Perceptron (MLP).features.These features include Eigen ratio denoting
harmonicity as well as gain jointly,silence ratio and zero
Keywords - SVM,MLP,KNN,RBF,Eigen ratio.crossing rate.The remainder of this paper is organized as
follows.Section 2 briefly discusses features commonly used
1.INTRODUCTION for audio classification scenario.Section 3 briefly reviews
the-state-of-the-art classifiers used in audio classification.
As the amount of data increases,efficient management of Section 4 presents some simulation and experimental results
digital content is becoming more and more important.and Section 5 concludes.
However,most of the indexing and labeling of the music is 2.FEATURES TYPE
currently performed manually,which is obviously time-
consuming and expensive.Content-based classification of The first step in any classification problem is to identify the
audio data is an important problem for various applications features that are to be used for classification.Independent of
such as overall analysis of audio-visual streams,boundary which classifier is used,the choice of feature set play key
detection of video story segment,extraction of speech role in classification performance.Hence,it is sometimes
segments from video,and content-based video retrieval,left to the brute power of an algorithm to decide which
Though the classification of audio into single type such as features are the best.In the following sections we review
music,speech,environmental sound and silence is well some common features used for audio classification.
studied,classification of mixed type audio data,such as
clips having speech with music as background,is still 2.1.Existing Features
considered as a challenging but difficult problem [1-3].
Audio classification is important due to the following The features typically used in audio classifiers are divided
reasons:(a) different audio types should be processed into physical and perceptual categories.Physical features
differently and (b) the searching space after classification is are properties that correspond to physical quantities,such as
significantly reduced to a particular subclass during the fundamentalfrequency (FO),energy (gain in dB),ZCR and
retrieval process.Each classified audio piece will be modulation rate [1].Some other previous works have also
individually processed and indexed to be suitable for used Mel Frequency Cepstral Coefficients (MFCC) features
efficient comparison and retrieval.For example,if an audio for music/speech only classification.However,in our
piece is speech,a speech recognition technique will be scenario,we a 3-class problem is considered instead of 2-
applied and recognized spoken words will be indexed using class.Since the MFCC averages the spectral energy
the text information retrieval technique [1].distribution in each Subband and thus may produce similar
Many audio classification algorithms have been average spectral characteristics for two different spectra
proposed in literature based on different features and related to mixed or one of music or speech only classes.
characteristics to classify most audio into speech,music and
978-1-4244-1807-7/08/$25.OO ©2008 IEEE.198
Authorized licensed use limited to: Aalborg Universitetsbibliotek. Downloaded on January 18, 2010 at 08:36 from IEEE Xplore. Restrictions apply.
2.2.Silence Ratio exponentials L.Using Multiple SIgnal Classifiction
(MUSIC) algorithm as an eigen-based subspace
As amplitude change with time is known as the basic audio decomposition method the autocorrelation matrix can also
representation,some related statistics of audio sample be written in terms of its eigen decomposition in order to
amplitudes have been introduced as suitable features for estimate frequencies of complex sinusoids observed in
audio classification.In this manner,one very useful additive white noise given ni (1) as follows:
statistics for audio classification is silence ratio (SR)
defined as the ratio between the amount of silence of an M H H (7)
audio piece and the length of the piece.Different types of Rx = 2qmq = QDQ(7)
audio have different SRs.For example,speech has normally
higher SRs than music.Audio can be classified with an where )Lm are the eigenvalues in descending order,that is,
appropriately selected SR threshold.Xl.L..XM,and qm are their corresponding eigenvectors.
Here D is a diagonal matrix made up of the eigenvalues
2.3.Eigen Ratio found in descending order on the diagonal,while the
columns of Q are the corresponding eigenvectors.The
Speech signal contains both periodic and non-periodic eigenvalues due to the signals can be written as the sum of
information due to the impulsive nature of events or"noise- the signal power in the time window and noise as follows:
like"processes occurring in unvoiced.As a result we can
write for a time window segment of the underlying observed 2 2
audio signal as follows:LI=L +%T for f <L (8)
y(n) f=r af cos(27rwfn + (p9 ) -t (n) (1) the remaining eigenvalues are due to the noise only,that is:
now consider the time-window vector model consisting of a =CT 2 for >L
sum of complex exponentials in noise from (1).The w(9)
autocorrelation matrix of this model can be written as the
sum of signal and noise autocorrelation matrices therefore,the L largest eigenvalues correspond to the signal
made up of complex exponentials and the remaining
x= E{x(n)x(n)}RH + (2) eigenvalues have equal value and correspond to the noise.
Based on the abovementioned Eigen decomposition,we
LE aj2v(f)vH (f)+U2 =VSVH +2I (3) introduce an appropriate feature which can discriminate
_,(f) + +u ~I ~ audio signals based on their harmonicity and principle
where component value.This parameter is defined as follows:
V(f)= [1,v(fl)...v(f ]T (4) D = {X.},i = 1...,N (10)
Al = argmax(D)....(11)
is an NxL matrix whose columns are time-window i
frequency vectors atfi with i=O,...,L and we thus have i 2 = argmax(D) (12)
Sa 0 0 Eigen Ratio = A2 (13)
0 la2 2 (5)
as it is seen from (10-13),the Eigen ratio is related to the
L...° aLI2 second eigen value over the 2,ax This is equivalent to some
considerable value in the case of music signals in contrast to
is a diagonal matrix of the powers of each of the respective negligible values for speech signal.As a result,Eigen ratio
complex exponentials.The autocorrelation matrix of the combined with other useful features can be used to improve
white noise will be the performance of state-of-the-art audio classifiers.In this
paper,we demonstrate that selecting this feature can be used
R~- 2I (6) to discriminate two single classes of music and mixture.To
= Iw elaborate,consider that the sound is periodic,and then it is
most likely voiced.In contrast,if we assume the sound is
which is full rank,as opposed to R,which is ranlk-deficient non-periodic with strong high-frequency components,it is
for L<N.In general,we will always choose the length of our likely to be unvoiced.In addition,when a piece of sound is
time window N to be greater than the number of complex
Authorized licensed use limited to: Aalborg Universitetsbibliotek. Downloaded on January 18, 2010 at 08:36 from IEEE Xplore. Restrictions apply.
considered rhythmic,it often means that it repeats in some approaches are briefly reviewed and then employed for the
way,on a time scale much longer than that required to underlying audio classification in simulation results.
generate a frequency.Periodicity in feature values is a good
indicator of rhythmicity,and this requires examining the Paftern classes Pattern earning
sound at a much longer scale than most feature extractors'
Training data
use.In addition,note that in harmonic sound the spectral AAe___Feangta
components are mostly whole number multiples of the PrePrcsing 9extraction
audio Feature Testdata
lowest and most often loudest frequency.This loudest vectors
frequency is equivalent to the largest eigen-value obtained C a
in (8).Music is normally more harmonic than other sounds.Recognized class
As a result,to distinguish between music and speech,we R
use eigen-ratio defined in (13) as a useful feature for Fig.1.Block diagram ofthe proposed audio classification system.
representing harmonicity.The one with high harmonicity is
music,otherwise it is speech.This can be best observed in
3.1.K-Nearest Neighbour Classifier
3.Classification and Pattern Learning One of the simplest classifiers is the k nearest Neighbour
Classifier (k-NN),used in [12],[13].The distance between
As discussed in previous sectons,the classification the tested feature vector and all of the training vectors from
proposed here is mostly based on the following different classes is measured.The classification is done
discrimination criteria:(1) Speech signals have higher SRs according to the k nearest training vectors.
and low eigen-ratio.(2) Music or mixed audio signals In k-NN approach,a data vector to be classified is
including speech and music have rare periods of silence,compared to training data vectors from different classes and
long harmonic tracks and denoting higher eigen-ratios.As a classification is performed according to distance to the k
result of this feature type selection,we observed that we can nearest neighbouring data points.In this paper,the so-called
convert the underlying mixed type audio classification Euclidean distance is used in determining the nearest
problem into the so-called two class problem.Each point neighbours.The distance between the vectors x to be
mapped into the new feature domain (consider x-axis is gain classified and the training vectors y is measured by:
in dB and y-axis as the eigen-ratio introduced in 2.3).All
we need is a non-linear classifier to accurately separate the T
overlapping regions.D=(x-y) (x-y) (13)
Clustering algorithms work by examining a large
number of cases and finding groups of cases with similar The classification is done by picking the k points nearest to
parameters called clusters,and are considered to belong to the current test point,and the class most often picked is
the same category in the classification.Once the clusters chosen as classification result.The neighbours should be
have been discovered,a representative case is chosen for close to the test data point x to get accurate estimate,but
each cluster,usually corresponding to the center of each still the number of neighbours should be large enough to get
cluster,and new cases are classified depending on the a reliable estimate for a posteriori probability of the data
proximity to the representative cases.The problem of audio belonging to each class.The implementation of k-NN
classification can be divided into the following sections:includes storing all the training data and calculating a
Feature Extraction (FE),Training,and Classification.Fig.1 distance between every test point and all training data.
shows the diagram of the proposed audio classification
system in this paper.3.2.Multi-layer Perceptron
The feature extraction procedure begins with frame
blocking followed by windowing to minimize edge-effects Neural network used in this research is a general Multilayer
(spectral leakage) problem.Successive windows overlap Perceptron (MLP).Both feed forward and the back
each other 5 ms.In addition,as our Silence/active detection propagation architecture are used.Network topology
stage,the threshold is fixed at about 500 of the maximum consists of 2 input nodes and 6 neurons are used in input
amplitude in our implementation.As one sample takes very and hidden layer,respectively.The output node indicates
short period of time,a silence period is detected only when the classification result.The maximum training cycles is set
a number of samples are below silence threshold.to 50.For evaluation purposes,9000 of the whole data were
In our experiment,we use 10 ms as minimum number of used as training set and the rest for test.
silence period.After the detection of silence periods of an
audio file,SR is calculated as the sum of all silence periods 3.3.SVMClassifier
divided by the length of the entire audio file.In the In addition to the previous multi-class learning methods,a
following sections,some commonly used classification binary classification approach with SVM is previously
Authorized licensed use limited to: Aalborg Universitetsbibliotek. Downloaded on January 18, 2010 at 08:36 from IEEE Xplore. Restrictions apply.
studied in [7-8].Feature vectors are non-linearly mapped need never be explicitly calculated if there is an appropriate
into a new feature space with thanks to some support Mercer kernel operator for which
vectors found by SVM and a hyperplane is then searched in
the new feature space to separate the data points of the K(Xi,Xj)= D(Xi).D(Xi) (18)
classes with a maximum margin.The SVM is a supervised
classification system that minimizes an upper bound on its Data not linearly separable in the original space may
expected error.become separable in this feature space.In our
In [7],SVM is extended into multi-class classification implementation,a radial basisfunction (RBF) kernel
with one-versus-the-rest,pairwise comparison,and multi-
class objective functions.In the one-versus-the-rest method,- 2 x x
binary classifiers are trained to separate one class from rest K(Xi,Xj) = eYD (Xi:X1) (19)
of the classes.The multi-class classification is then carried
out according to the maximal output of these binary is selected where D2(X j) could be any distance function.
classifiers.In pairwise comparison,a classifier is trained for Thus the space of possible classifier functions consists of
each possible pair of classes and the unknown observation is linear combinations of weighted Gaussians around key
assigned to the class getting the highest number of training instances [14].In this paper,we use SVM with a
classification"votes"among all the classifiers.In the multi- Gaussian kernel,based on the features selected for the
class objective-function method,the objective function of a discrimination of the original classes.SVM are used in a
binary SVM is directly modified to allow the simultaneous"one vs.one"fashion,with probabilistic outputs.
computation of a multi-class classifier.
SVM attempts to find the hyperplane separating two 4.Experimental Results
classes of data that will generalize best to future data.Such
a hyperplane is the so called maximum margin hyperplane,We observed that kNN reaches at its best classification
which maximizes the distance to the closest point from each performance i.e.82% for 3 nearest neighbours.As it can be
class.More concretely,given data points {xO,...,XN} and seen from Table.1,the classification performance gets to
class labels {yo,.,y},{-1,1},any hyperplane approximately 83% while using MLP (feed forward).Next,
separating the two data classes has the form employing a back propagation neural network,classification
accuracy was 89%.Finally,SVM with RBF kernel function
yi (wTXi + b) > 0 for any i (14) is used and obtained results are shown in Fig.2.
Let {Wk} be the set of all such hyperplanes.The maximum
0.9 dt.2
margin hyperplane is denoted by 0i i'tet
0.7 0 o
Y,=Oaiyixi ~~~~~~~~~(15)0.0
~~~~~Na.y.x.~~~~~~~~~~~~~~~~~~~~~~~~~~~~0.5 0 0-
where {ao,a1,,aN} maximize 0.3 0 0
0 0 0
0.2 0- ~ ~ ~
LD = j jZ0_0 =Oax1ax1jyiyj_1 Xj (16) 0.1.;
Gain (dB)
subject to Fig.2.Showing (a) the decision boundaries using SVM.
Table.1.Audio Classification results.
Nj aijyj=O a1.>0 for any i (17) CAudio (in 0) Pam_ee
MLP P(music/mix) 80.14 2 input nodes 6 neurons
for linearly separable data,only a subset of the a,s'will be P(mix musiC) 83.94 inhidden layer
non-zero.These points are called the support vectors and all P(mix/music) 81.51 Lnear
classification performed by the SVM depends on only these SVM P(music/mix) 90.73
points and no others.SVMs system can accurately identify P(mix/music) 89.58 y-0.5 RBF
the critical samples that will become the support vectors,P(music/mix) 90.97 y=0.3,RBF
training time and labeling effort which,in the best case,can P(mixlmusic) 88.03
bereucedwithno impact on classifier accuracy.Since the NN |P(mxmusicmi) 896.86 2 inptnodest/6 neuos
data points X only enter calculations via dot products,one KN Pmscmx 20 anmeo
can transform them to anotherfeature space via a function P(mix/music) 81.09 neighbors=~2,3
F(X).The representation of the data in this feature space
Authorized licensed use limited to: Aalborg Universitetsbibliotek. Downloaded on January 18, 2010 at 08:36 from IEEE Xplore. Restrictions apply.
It is seen from Table.1 that the eigen-ratio feature is quite promising results.Using the commonly used classifiers
effective in classifying audio into music and speech.In including KNN,MLP and SVM we observed that such
addition,according to results in Table.1,using SVM feature selection results in significantly better results in
classifier can significantly result in higher classification contrary to the features used in previous works.Conducting
accuracy around 90°0.Careful investigation is needed while different simulation demonstrated that the proposed
adjusting SVM parameters including y.In our experiments technique could result in better results using a SVM
we found that 0.3<y<0.4 is a favorable choice.Fig.3 classifier with RBF kernel while reaching a trade off
demonstrates the outputs of the SVM and the original between classification accuracy and elapsed time.
speech and music signals.The SVM detects the pre and
post-silence segments as depicted.Reference
SVM_classificati [1] Frakes W.B.and Baeza-Yates R.(ed.),Information Retrieval:
Data structures and Algorithms,Prentice Hall,1992.
N.I I I II[2] Erling Wold et al,"Content-based classification,search,and
retrieval of audio",IEEE Multimedia,pp.27-36,1996.
if ~~~~~~~~~~~~~[3]Asif Ghias et al,"Query by Humming - Musical Information
-0.5 ~~~~~~~~~~~~~~Retrievalin an Audio Database",Proceedings of ACM Multimedia
95,San Francisco,California,pp.5-9,Nov.1995.
20 4 6 8 [4] H.Soltau,T.Schultz,M.Westphal and A.Waibel.Recognition
0200 400 600 800 1000 1200 1400 1600 1800 2000*
of Music Types.Seattle,WA,IEEE,ICASSP,1998.
Speech [5] T.Li,M.Ogihara and Q.Li.,"A comparative study on content-
0.5 _ based music genre classification",Proceedings of the 26th annual
international ACM SIGIR conference on Research and
development in informaion retrieval,pp.282-289,July,2003.
.0.5 [6] C.Xu,N.C.Maddage,X.Shao,F.Cao and Q.Tian,"Musical
Genre Classification Using Support Vector Machines",IEEE
o200 400 600 800 1000 1200 1400 1600 1800 2000 ASSP,vol.5,pp.429-432,2003.
1 Music [7] J.Saunders,"Real-time discrimination of broadcast
0.5 speech/music",IEEE ASSP,pp.993-996,1996.
[8] M.J.Carey,B.S.Parris,and H.Lloyd-Thomas,"A comparison
:tt0~~~ ~ ~ ~ ~ ~~~~~- of features for speech,music discrimination,"in ICASSP,April,
-0.eWl.l hl li:1 1| b [ [9] K.El-Maleh,M.Klein,G.Petrucci,and P.Kabal,
-1 20 0 4"Speech/music discrimination for multimedia applications,"in
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Samples ICASSP,June,2000.
Fig.3.Audio classification results:top panel:audio mixture with SVM [10] S.Z.Li,"Content-based audio classification and retrieval
boundary results,middle:speech only,and bottom:music.using the nearest feature line method,"IEEE Trans.on Speech and
Audio Processing,2000.
As real-time processing has been introduced as a requisite in [1] L.Lu,H.Jiang,and H.J.Zhang,"A robust audio
many audio classification literatures.As a result,we have classification and segmentation method,"in Proc.9th ACM Int.
also conducted an evaluation to compare the elapsed time to Conf on Multimedia,2001.
implement each one of the classifiers discussed in this [12] T.Li,M.Ogihara and Q.Li.,"A comparative study on
The results are summarized as shown in Table 2.It is content-based music genre classification",Proceedings of the 26th
paer.that SVM approach reaches at a trade-off between annual international ACM SIGIR conference on Research and
claseenftatioSVMlapproachrimeach atcuracy.trade-offbdevelopment in informaion retrieval,pp.282-289,July,2003.
classification elapsed time and accuracy.[13] C.Xu,N.C.Maddage,X.Shao,F.Cao and Q.Tian,"Musical
Genre Classification Using Support Vector Machines",IEEE
Table.2 processing elapsed time for different classifiers.ASSP,vol.5,pp.429-432,2003.
Cler Ea s|ed tee [14] Cristianini,N.,Shawe-Taylor,J."An introduction to support
KNN 0.116 Vector Machines:and other kernel-based learning methods",
MLP 0.507 CambridgeUniversityPress NewYork 2000.
Back propagation 0.662 U,
SVM 0.228
A new audio classification frame work was proposed and
evaluated based on some new feature called eigen-ratio.Our
preliminary experiment using eigen-ratio and SRs show
Authorized licensed use limited to: Aalborg Universitetsbibliotek. Downloaded on January 18, 2010 at 08:36 from IEEE Xplore. Restrictions apply.