Predictor Filter Used In Analyzer Of Lpc Systems

Print   

02 Nov 2017

Disclaimer:
This essay has been written and submitted by students and is not an example of our work. Please click this link to view samples of our professional work witten by our professional essay writers. Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of EssayCompany.

INTRODUCTION

In this Research work the author implemented three techniques for extracting features of speech signals, they are linear predictive coding(LPC), Mel frequency cepstral coefficients(MFCC) and wavelet analysis, After extracting features of speech signals are applied to

Speech Recognizer here author implemented two techniques like Dynamic time warping and Hidden Markov Model for recognizing telugu words.

5.1 LINEAR PREDICTIVE CODING

Linear Predictive Coding is analytical or synthesis method, Introduced in the sixties, for predicting a present sample of speech based on several previous samples. LPC is an Powerful way of getting synthesized speech signal. The efficiency of LPC method is speed of the analysis algorithm and low bandwidth required for the encoded signals.

5.1.1 FUNDAMENTALS OF LINEAR PREDICTIVE CODING

There are two ways of estimating the spectral envelope of a sound. First way is through Fast Fourier Transform (FFT), which measures the spectrum of a sound by sampling amplitude values at equally spaced frequency points. This method provides an exact estimation of the spectrum. The other method is to use Linear Predictive Coding, (LPC). This method measures the overall spectral envelope to create a linear image of the sounds spectrum. Both have their Advantages, and Disadvantages, but LPC is particularly effective with manipulating speech. LPC generally deals with modeling and FFT makes the spectrum estimation. LPC is one of the most powerful and most useful methods for encoding good quality speech at a lower bit rate. It provides extremely exact estimates of speech parameters, and is more efficient for computation. The main assumption of the LPC is we can predict nth sample in a sequence of speech samples and represent it by the weighted sum of the k previous samples of the target signal.

(5.1)

where p indicates the order of the LPC. As p approaches infinity, we can predict the exact value of nth sample.

With the limitation in computation, p is of the order of 10-20, so that it will produces the accurate results. The equation 5.2 shows the error signal, and is treated as LPC residual

(5.2)

Taking Z transform of equation (5.2)

(5.3)

(5.4)

(5.5)

Thus,the error signal can be treated as the product of original speech sample, S[z] and the transfer function, A[z]. A[z] represents an all-zero digital filter, where the ak coefficients reprasents zeros in the filter’s z-plane. Similarly, we can retrieve the original speech signal S[z] as the product of the error signal E[z] and the transfer function 1/A[z].

(5.6)

The transfer function 1/A(z) represent an all-pole digital filter, where the ak coefficients represents poles in the filter’s z-plane. For stability, the roots of transfer function, A[z] must be in the unit circle.

The spectrum of the error signal E[z] is different for voiced and unvoiced sounds. Vibrations of the vocal chords produce voiced sounds, while the unvoiced sounds have less energy and higher frequencies. The Spectrum of voiced sounds are periodic with some fundamental frequency called pitch. The unvoiced signals, don’t have any fundamental frequency.

LPC digitally encodes the analog signals using a single or multi level sampling system, in which the value of the signal at each sample time is predicted to be a linear function of the past values of the quantized signal.  LPC relates to adaptive predictive coding (APC) because both use adaptive predictors. However, LPC uses more prediction coefficients to permit use of a lower information bit rate than APC, and thus requires a more complex processor. Usually, speech is sampled at 8 KHz with sample size of 8 bits. Therefore, the data processing rate would be 64000 bits/second. LPC algorithm uses compression algorithm to reduce the data rate to 2400 bits/second. LPC does so by breaking the speech into segments and then sending them as voiced or unvoiced information, the pitch period, and the coefficients for the filter that represents the vocal tract for each segment.

At 2400 bits/sec of bit-rate, the speech has a distinctive synthetic sound and there is a noticeable loss of quality of compressed sound. However, the speech is still audible and is understandable. Since there is information loss in linear predictive coding, it is a lossy form of compression. LPC analyzes the speech signal by estimating the formants, removing their effects from the speech signal, and estimating the intensity and frequency of the remaining buzz. Formants frequencies are the frequencies at which resonant peaks occur. The number that describes the formants and the residue can be stored or transmitted somewhere else.

When the predictor filter has been adjusted to predict the input, at best it can do so from the immediate preceding samples. The difference between the input speech and the predictor output (known as residual) will have roughly flat spectrum. The spectral peaks caused by the resonance of speech production will have to be removed. For the same reason, the complete filtering process is sometimes referred to as inverse filtering. That is, LPC synthesizes the speech signal by reversing the process, use the residue to create a source signal, use the formants to create an all-pole filter (which represents the tube), and run the source through the filter, resulting in speech. Since the speech signal varies with time, this process is done on short chunks of the speech signal called frames. Usually 30-50 frames per second give intelligible speech with good compression.

The figure 5.1 explains how the Predictor Filter is used in LPC systems

Residual signal

Speech signal

Analyzer

Predictor Filter

Filter Coefficients calculated by analyzing short sections called frames of speech signal

Figure 5.1 Predictor Filter used in Analyzer of LPC Systems

Controlled by transmitter’s Predictor coefficients

synthesizer

Predictor Filter

Speech out put

Excitation signal, generally residual signal

Figure 5.2: Predictor Filter used in Synthesizer of LPC Systems

5.1.2 LPC MODEL

LPC Model

Analysis / encoding(done by transmitter)

Synthesis/decoding(done by receiver)

Figure 5.3: LPC Model

5.1.3 GENERALIZED LPC ALGORITHM

LPC ALGORITHM

Co-efficient extraction

Cross Correlation

Analysis

Preemphasis

INPUT

OUT PUT

All pole filter design

Deemphasis

Synthesis

Unvoiced samples

Voiced samples

Pitch information

Coefficients

Figure 5.4: Generalized LPC Algorithm

In LPC, the input signal is sampled, broken into segments/blocks/frames, analyzed and then transmitted to the receiver.

Pre-emphasis refers to a process designed to increase, within a band of frequencies, the magnitude of some (usually higher) frequencies with respect to the magnitude of other (usually lower) frequencies in order to improve the overall signal-to-noise ratio by minimizing the adverse effects of such phenomena as attenuation distortion or saturation of recording media in subsequent parts of the system.

In speech analysis, it is computationally intensive to determine the pitch period for a given segment of speech. There are several algorithms to compute this One such algorithm takes advantage of the fact that the autocorrelation of a period function, r(k), will have a maximum value when k is equivalent to the pitch period.

From equation (5.2),

(5.7)

The sum of the squared error to be minimized is expressed as:

(5.8)

By setting the derivative of E with respect to ai (using the chain rule) to zero, one obtains

for k = 1, 2, 3, 4, 5, ……, M (5.9)

Equation (5.9) can be easily expressed in matrix form as:

where,

(5.10)

a = R – 1 * r (5.11)

To solve equations (5.10) and (5.11), any efficient matrix solving methods like Decomposition, Gauss Elimination, Levinson-Durbin recursive method, Cholesky Decomposition etc. can be used.

Of the above methods for solving matrix, the Levinson-Durbin method is very efficient because it just needs M2 multiplications to compute the linear prediction coefficients. From equation (5.8), the sum of squared error can be expressed as:

(5.12)

The above equation can be re-written for Mth order precision,

(5.13)

Solving above equation we get:

(5.14)

When m = 0, E0 = r(0)

When m = 1, E1 = r(0) – a11 * r(1)

So, a11 = r(1) /r(0) = K1, where K1 is termed as reflection co-efficient.

Now | K1| <1 and as r(1) < r(0) → E1 = r(0) [1-K12]

From above equation, it can be concluded that the prediction error E1 is always less than E0.

When m =2, the following recursion is performed

for i =1, …..m-1

If m < M, then increase m to m+1 and go to (i). If m = M, then stop.

Pitch Estimation: Pitch can be estimated by auto-correlation method, average magnitude difference method and cepstrum. I have used auto correlation method in my code.

The autocorrelation of a stationary sequence x(n) is defined as:

here is termed as lag. An autocorrelation is the average correlation between two samples from one signal and its Ï„ samples delayed signal. In MATLAB, inbuilt function "xcorr" can be used either for cross correlation between two signals or auto correlation between a signal with itself.

The analysis/encoding part of LPC examines the speech signal by breaking it down into segments or blocks. Each segment is then examined further to find

voiced / unvoiced segment

pitch information important for this particular segment

If both the linear prediction coefficients and the residual error sequence are available, the speech signal can be reconstructed using the synthesis filter. The receiver performs LPC synthesis by using the answers received to build a filter when provided the correct input source will be accurately reproduce the original speech signal. Essentially, LPC synthesis tries to imitate human speech production.

LPC is the method of separating out the effects of source and filter from a speech signal; similar in intention to cepstral analysis but using quite different methods. One way of thinking about LPC is as a coding method a way of encoding the information in a speech signal into a smaller space for transmission over a restricted channel. LPC encodes a signal by finding a set of weights on earlier signal values that can predict the next signal value

If values for a[1..3] can be found such that e[n] is very small for a stretch of speech (say one analysis window), then we can transmit only a[1..3] instead of the signal values in the window. The speech frame can be reconstructed at the other end by using a default e[n] signal and predicting subsequent values from earlier ones. Clearly this relies on being able to find these values of a[1..k] but there are a couple of algorithms which can do this. The result of LPC analysis then is a set of coefficients a[1..k] and an error signal e[n], the error signal will be as small as possible and represents the difference between the predicted signal and the original.

There is an obvious parallel between the LPC equation and that of a recursive filter

where we have rearranged the terms as in Equation. The LPC coefficients correspond to those of a recursive filter and the error signal corresponds to a source signal. Moreover, the conditions under which the error signal is minimized in LPC analysis mean that the error signal will have a flat spectrum and hence that the error signal will approximate either an impulse train or a white noise signal. This is a very close match to our source filter model of speech production where we excite a vocal tract filter with either a voiced signal (which looks like a series of impulses) or a noise source. So, LPC analysis has the wonderful property of finding the coefficients of a filter which will convert either noise or an impulse train into the original frame of speech.

The result isn't quite perfect the filter coefficients derived by LPC analysis contain information about the glottal source filter, the lip radiation or preemphasis filter and the vocal tract itself. However since these are much less variable than the vocal tract filter we can factor them out in practice (eg. by preemphasis before LPC analysis).

5.1.4 FORMANTS AND SMOOTH SPECTRA

We need to know about z-transforms to cover LPC analysis well, if this were as far as we were going then we didn't need z, but LPC is really just a way in to some more interesting signal analysis techniques.

The LPC coefficients make up a model of the vocal tract shape that produced the original speech signal. A spectrum generated from these coefficients would show us the properties of the vocal tract shape without the interference of the source spectrum. we know that we can take the spectrum of the filter in various ways, for example by passing an impulse through the filter and taking it's DFT, or by substituting for z=exp(iWk) in the z transform of the signal. Either way, the result can be quite useful in signal analysis.

Looking at an LPC smoothed spectrum of voiced speech we can clearly see the formant peaks they tend to be much more well defined than in a cepstrally smoothed spectrum. We can use the z-transform notation to find the locations of these formant peaks for a given set of LPC coefficients, corresponding to the points at which A(z) is zero. This is the key to automatic formant tracking of speech signals derive the LPC coefficients, solve the z-transform equation and record the resulting formant positions. Unfortunately since the LPC model isn't a perfect fit to real speech production (it assumes a lossless, all pole model, for example) this method will derive spurious formants.

LPC coefficients can also be used to derive cepstral coefficients and area functions. LPC is a powerful signal modeling technique and is very important in speech recognition and speech analysis.

One of the most powerful signal analysis techniques is the method of linear prediction. LPC of speech has become the predominant technique for estimating the basic parameters of speech. It provides both an accurate estimate of the speech parameters and it is also an efficient computational model of speech. The basic idea behind LPC is that a speech sample can be approximated as a linear combination of past speech samples. Through minimizing the sum of squared differences (over a finite interval) between the actual speech samples and predicted values, a unique set of parameters or predictor coefficients can be determined. These coefficients form the basis for LPC of speech. The analysis provides the capability for computing the linear prediction model of speech over time. The predictor coefficients are therefore transformed to a more robust set of parameters known as cepstral coefficients.

BASICS OF LP ANALYSIS

The redundancy in the speech signal is exploited in the LP analysis. The prediction of current sample as a linear combination of past p samples form the basis of linear prediction analysis where p is the order of prediction. The predicted sample can be represented as follows,

(5.15)

where ak is the linear prediction coefficients and s(n) is the windowed speech sequence obtained by multiplying short time speech frame with a hamming or similar type of window which is given by,

(5.16)

where ω(n) is the windowing sequence. The prediction error e(n) can be computed by the difference between actual sample s(n)and the predicted sample s^(n) which is given by,

(5.17)

(5.18)

(5.19)

The primary objective of LP analysis is to compute the LP coefficients which minimized the prediction error e(n). The popular method for computing the LP coefficients by least squares auto correlation method. This achieved by minimizing the total prediction error. The total prediction error can be represented as follows,

(5.20)

This can be expanded using the equation (5.20) as follows,

(5.21)

The values of ak which minimize the total prediction error E can be computed by finding

and equating to zero for k=0,1,2,...p. for each ak give p linear equations with p unknowns. The solution of which gives the LP coefficients. This can be represented as follows,

(5.22)

The differentiated expression can be written as,

(5.23)

where i=1, 2, 3...p. The equation (5.23) can be written in terms of autocorrelation sequence R(i) as follows,

(5.24)

for i=1,2,3...p.

Where the autocorrelation sequence used in equation (5.23) can be written as follows,

(5.25)

 for i= 1,2,3...p and N is the length of the sequence.

This can be represented in the matrix form as follows,

where R is the pXp symmetric matrix of elements R(i, k) = R(|i-k|), (1<=i, k<=p), r is a column vector with elements (R(1),R(2), ...R(P)) and finally A is the column vector of LPC coefficients (a(1), a(2), ....a(p)). It can be shown that R is toeplitz matrix which can be represented as,

(5.26)

The LP coefficients can be computed as shown,

where R-1 is the inverse of the matrix R

 

IMPLEMENTATION

The basic steps of LPC processor include the following:

LPC Algorithm

The algorithm can be divided into the following main processing blocks: windowing and pre-emphasis filtering;

Autocorrelation computation;

Levinson-Durbin algorithm;

Pitch detection.

The speech signal is sampled at a frequency of 8 KHz and is processed in the LPC algorithm in blocks of 240 samples. The processed blocks overlap by 80 samples. Therefore, the LPC algorithm must extracted the required parameters characterizing the 240 samples of speech each 20 ms. In the segmentation processing block, the segments of 240 samples are high pass filtered and windowed using a Hamming window. The obtained data samples are used as the input to the autocorrelation and pitch detection blocks. The detection of silent speech blocks is also done in this processing unit. The autocorrelation processing block computes the autocorrelation of the 240 windowed samples for 11 different offsets. The Levinson-Durbin algorithm is then used to solved a set of 10 linear equations. These linear equations are functions of the sequence autocorrelation and the solution is the set of feedback coefficients αI’s. The transfer function gain G is also obtained from the Levinson-Durbin algorithm. The Levinson-Durbin algorithm is a recursive algorithm involving various numerical manipulation and in particular the computation of a division. The 240 windowed samples are also processed to obtained the voiced or unvoiced characteristics of the speech signal. They are first low pass filtered using a 24 taps FIR filter. The filtered signal is then clipped using a 3-level center clipping function . The autocorrelation of the clipped signal is then computed for 60 different sequence offsets. From these autocorrelation results the voice or unvoiced parameters of the speech signal are extracted.

5.1.7 PRE EMPHASIS

The spectrum of a phoneme or word has considerable variation, particularly between vowels and consonants. Vowels frequencies are mainly higher in amplitude since during their production the vocal tract is particularly configured to give rise to low frequency resonances. Consonants, on the other hand, are sudden bursts of air from the mouth or other kind of turbulent airflows, which results in high-frequency noise-like flat spectrum.

Conversational speech being in general, a mixture of voiced and unvoiced components has low frequency components having high amplitudes and high frequency components having low amplitudes. To reduce these amplitude differences, the speech signals are spectrally flattened by means of a pre-emphasis first-order FIR filter whose transfer function is given by

(5.27)

In the discrete-time domain, this is equivalent to the difference equation

(5.28)

where x(n) is the original speech signal and y(n) is the pre-emphasized signal. There are two advantages of using this filter.

First, it boosts the negative spectral slope of voiced sections of speech. This improves the efficiency in speech analysis.

Second, hearing is more sensitive to the frequencies above 1kHz. Pre emphasis amplifies this area of the spectrum, thus modeling an important perceptual aspect of the auditory system.

5.1.8 FRAME BLOCKING AND WINDOWING

Due to the differences in phoneme’s spectral features, changes in prosody, and random variations in the vocal tract, speech is a non-stationary signal. However, in a short time interval (generally from 10 to 20 ms) it is assumed that the speech signal is stationary, and therefore it is analyzed over these shot-time windows. So the frame blocking procedure consists essentially dividing the speech signal into short frames of N samples, which overlap by M samples, with adjacent frames.

In order to minimize spectral distortions when blocking the speech signal, each frame is multiplied with a Hamming window of the form

(5.29)

where N is the duration (in samples) of the speech frame. The output y(n) of the windowed signal becomes:

(5.30)

This windowing function acts as a low pass filter, enhancing the signal at the window

center and smoothening it at the edges.

Autocorrelation Analysis, the next step is to auto correlate each frame of windowed signal in order to give

; M =0, 1, ……p (5.31)

where the highest autocorrelation value, p, is the order of the LPC analysis.

5.1.9 LPC ANALYSIS

The next processing step is the LPC analysis, which converts each frame of p + 1 autocorrelations into LPC parameter set by using Durbin’s method. This can formally be given as the following algorithm:

(5.32)

(5.33)

(5.34)

Solving the above recursively from i=1.2..p the LPC coefficient, m a , is given as

(5.35)

5.1.10 PITCH CALCULATION

One of major limitations of the autocorrelation representation is that in a sense it retains too much of the information in the speech signal. To avoid this problem it is again useful to process the speech signal so as to make the periodicity more prominent while suppressing other distracting features of the signal.

This was the approach followed to permit the use of very simple pitch detector. Techniques which perform this type of operation on signal are sometimes called  "spectrum flatteners" since their objective is to remove the effects of the vocal tract transfer function, thereby bringing each harmonic to the same amplitude level as in the case of a periodic impulse train. There are numerous spectrum flattening techniques however, a technique called "center clipping" is the technique we used.

In the scheme, the center clipped speech signal is obtained by a nonlinear transformation

(5.36)

Figure 5.5: Center clipped speech signal

The operation of the center clipper is depicted in fig (5.5) for this segment, the maximum amplitude, Amax' is found and clipping level, CL, is set equal to   a fixed percentage of Amax’ (we used  30%). From the figure (5.5), it can be seen that for samples above CL the output of the center clipper is equal to the input minus the clipping level. However, first let us examine the effect of the clipping level.  Clearly, for high clipping levels, fewer peaks will exceed the clipping level and thus fewer pulses appear in the output, and therefore, fewer extraneous peaks will appear in the autocorrelation function. Clearly, as the clipping level is decreased, more peaks pass through the clipper and thus the autocorrelation function becomes more complex.

5.1.11 LPC ADVANTAGES

LPC provides good model of speech signal

LPC represents the spectral envelope by low dimension feature vectors

Provides linear characteristics

LPC leads to a reasonable source-vocal tract separation

LPC is analytically tractable model

The method of LPC is mathematically precise and straight forward to implement in either software in hardware

5.1.12 LPC DISADVANTAGES

The LP models the input signal with constant weighting for the whole frequency range. However human perception does not have constant frequency perception in the whole frequency range.

A serious problem with the LPC is that they are highly correlated but it is desirable to obtain less correlated features for acoustic modeling.

An inherent drawback of conventional LPC is its inability to include speech specific apriority information in the modeling process.

5.1.13 SIMULATION RESULTS OF LPC

a) Analysis of LPC Before coefficients Applied To Recognition System

The following are the results obtained before applying to the speech recognition system these are the LPC coefficient graphs for voiced and unvoiced part of the input signal.

Figure 5.6: Plot of signal and its predictor coefficients

Figure 5.7: Plot of reflection coefficients and error

b) Results below is after applying LPC coefficients to the system

Matlab simulation results of Recognition of Telugu digits using LPC, If we increase the LPC order the execution time of the system is less.

5.2 MEL FREQUENCY CEPSTRAL COEFFICIENTS

The LPC is more or less like the sufficient statistics of random samples in statistics. The LPC coefficients are actually the least squares estimators of the regression coefficients, i.e., the minimum variance linear estimators of the regression coefficients. The huge data of a frame are well-represented by the LPC coefficients unless LPC coefficients are too small, i.e., the estimates of the regression coefficients are not significant as compared with noise. But this problem is solved using MFCC here more number of coefficients are available and huge data frames is well represented.

5.2.1 SPEECH RECOGNITION USING MFCC

Generally speaking, a conventional Automatic Speech Recognition (ASR) system can be organized in two blocks: the feature extraction and the modeling stage. In practice, the modeling stage is subdivided in acoustical and language modeling, both based on HMMs as described in Figure 5.10.

Figure 5.8 Simple representation of a conventional ASR

The feature extraction is usually a non-invertible (lossy) transformation, as the MFCC uses analogy with filter banks, such transformation does not lead to perfect reconstruction, i.e., given only the features it is not possible to reconstruct the original speech used to generate those features.

Computational complexity and robustness are two primary reasons to allow loosing information. Increasing the accuracy of the parametric representation by increasing the number of parameters leads to an increase of complexity and eventually does not lead to a better result due to robustness issues. The greater the number of parameters in a model, the greater should be the training sequence.

Speech is usually segmented in frames of 20 to 30 ms, and the window analysis is shifted by 10 ms. Each frame is converted to 12 MFCCs plus a normalized energy parameter. The first and second derivatives of MFCCs and energy are estimated, resulting in 39 numbers representing each frame. Assuming a sample rate of 8 kHz, for each 10 ms the feature extraction module delivers 39 numbers to the modeling stage. This operation with overlap among frames is equivalent to taking 80 speech samples without overlap and representing them by 39 numbers. In fact, assuming each speech sample is represented by one byte and each feature is represented by four bytes (float number), one can see that the parametric representation increases the number of bytes to represent 80 bytes of speech (to 136 bytes). If a sample rate of 16 kHz is assumed, the 39 parameters would represent 160 samples. For higher sample rates, it is intuitive that 39 parameters do not allow to reconstruct the speech samples back. Anyway, one should notice that the goal here is not speech compression but using features suitable for speech recognition.

5.2.2 MFCC AND ITS CALCULATION

The block diagram for calculating MFCCs is given below.

Figure 5.9 MFCC block diagram

There are two ways of looking to the MFCCs: (a) as a filter-bank processing adapted to speech specificities and (b) as a modification of the conventional cepstrum, a well known deconvolution technique based on homomorphic processing. These points of view are complementary and help getting insight about MFCCs. I will briefly describe each one.

5.2.3 MEL-SCALE: FROM AUDITORY MODELING

Before proceeding, let us take in account some characteristics of the human auditory system. Two famous experiments generated the Bark and Mel Scales, given below. Describe the experiments.

Table 5.1: Characteristics of human Auditory system

Bark

Mel

Filter

Frequency (Hz)

BW (Hz)

Frequency (Hz)

BW (Hz)

1

50

100

100

100

2

150

100

200

100

3

250

100

300

100

4

350

100

400

100

5

450

110

500

100

6

570

120

600

100

7

700

140

700

100

8

840

150

800

100

9

1000

160

900

100

10

1170

190

1000

124

11

1370

210

1149

160

12

1600

240

1320

184

13

1850

280

1516

211

14

2150

320

1741

242

15

2500

380

2000

278

16

2900

450

2297

320

17

3400

550

2639

367

18

4000

700

3031

422

19

4800

900

3482

484

20

5800

1100

4000

556

21

7000

1300

4595

639

22

8500

1800

5278

734

23

10500

2500

6063

843

5.2.4 CEPSTRAL ANALYSIS

Homomorphic processing is well discussed by Oppenheim in his textbooks. Cepstrum is maybe the most popular homomorphic processing because it is useful for deconvolution. To understand it, one should remember that in speech processing, the basic human speech production model adopted is a source-filter model.

Source: Is related to the air expelled from the lungs. If the sound is unvoiced, like in "s" and "f", the glottis is open and the vocal cords are relaxed. If the sound is voiced, "a", "e", for example, the vocal cords vibrate and the frequency of this vibration is related to the pitch.

Filter: Is responsible for giving a shape to the spectrum of the signal in order to produce different sounds. It is related to the vocal tract organs.

Roughly speaking: A good parametric representation for a speech recognition system tries to eliminate the influence of the source (the system must give the same "answer" for a high pitch female voice and for a low pitch male voice), and characterize the filter. The problem is: source e(n) and filter impulse response h(n) are convoluted. Then we need deconvolution in speech recognition applications. Mathematically:

In the time domain, convolution: source * filter = speech,

(5.37)

In the frequency domain, multiplication: source x filter = speech,

(5.38)

Working in the frequency domain, use the logarithm to transform the multiplication in into a summation (obs: log ab = log a + log b). It is not easy to separate (to filter) things that are multiplied as in (5.38), but it is easy to design filters to separate things that are parcels of a sum as below:

(5.39)

We hope that H(z) is mainly composed by low frequencies and E(z) has most of its energy in higher frequencies, in a way that a simple low-pass filter can separate H(z) from E(z) if we were dealing with E(z) + H(z). In fact, let us suppose for the sake of simplicity that we have, instead of (5.39), the following equation:

(5.40)

We could use a linear filter to eliminate E(z) and then calculate the Z-inverse transform to get a time-sequence co(z). Notice that in this case, co(z) would have dimension of time (seconds, for example).

Having said that, let us now face our problem the log operation in (5.39). Log is a non-linear operation and it can "create" new frequencies. For example, expanding the log of a cosine in Taylor series shows that harmonics are created. So, even if E(z) and H(z) are well separated in the frequency domain, log E(z) and log H(z) could eventually have considerable overlap. Fortunately, that is not the case in practice for speech processing. The other point is that, because of the log operation, the Z-inverse of C(z) has NOT the dimension of time as in (5.40). We call cepstrum the Z-inverse of C(z) and its dimension is quefrency (a time domain parameter).

There are 2 basic types of cepstrum: complex cepstrum and real cepstrum. Besides, there are two ways of calculating the real cepstrum (used in speech processing because phase is not important). LPC cepstrum and FFT cepstrum.

LPC cepstrum: The cepstral coefficients are obtained from coefficients FFT cepstrum: from a FFT

The most widely parametric representation for speech recognition is the FFT cepstrum derived based on a mel scale

5.2.5 FILTER-BANK INTERPRETATION

We go to frequency domain and disregard phase, working only with the power spectrum. Then, we use log because our ears work in decibels. To reduce dimensionality, we use a filter-bank with around 20 filters. The filters follow Mel-scale. We take the DCT-II because it is good.

The examples below shows where the MFCC did not capture the formants structure, i.e., they did not perform a good job.

The MFCC is a representation defined as the real cepstrum of a windowed short-time

signal derived from the fast Fourier transform of the speech signal. In the MFCC, a nonlinear frequency scale is used, which approximates the behavior of the auditory system. The discrete cosine transform of the real logarithm of the short-time energy spectrum expressed on this nonlinear frequency scale is called the MFCC.

On the other hand, to produce a MFCC, one has to obtain the DFT of a frame of the huge data and after the Mel filter banks smooth the spectrum, performs the inverse DFT on the logarithm of the magnitude of filter bank output.

5.2.6 IMPLEMENTATION

The extraction of the best parametric representation of acoustic signals is an important task to produce a better recognition performance. The efficiency of this phase is important for the next phase since it affects its behavior. MFCC is based on human hearing perceptions which cannot perceive frequencies over 1KHz. In other words, MFCC is based on known variation of the human ear’s critical bandwidth with frequency. MFCC has two types of filter which are spaced linearly at low frequency below 1000 Hz and logarithmic spacing above 1000Hz. A subjective pitch is present on Mel Frequency Scale to capture important characteristic of phonetic in speech. The overall process of the MFCC is shown in Fig 5.13

Speech signal

MFCC

Figure 5.13: Block diagram of MFCC

5.2.7 PRE-EMPHASIS

Pre-emphasis refers to a system process designed to increase, within a band of frequencies, the magnitude of some frequencies with respect to the magnitude of the others (usually lower) frequencies in order to improve the overall SNR. Hence, this step processes the passing of signal through a filter which emphasizes higher frequencies. This process will increase the energy of signal at higher frequency.

5.2.8 FRAMING

The process of segmenting the speech samples obtained from an ADC into a small frame with the length within the range of 20 to 40 msec. The voice signal is divided into frames of N samples. Adjacent frames are being separated by M (M<N). Typical values used are M = 100 and N= 256.

5.2.9 HAMMING WINDOWING

Hamming window is used as window shape by considering the next block in feature extraction processing chain and integrates all the closest frequency lines. The Hamming window is represented as if the window is defined as

w(n), 0 ≤ n ≤ N-1 where

N = number of samples in each frame

Y[n] = Output signal

X (n) = input signal

W (n) = Hamming window, then the result of windowing signal is shown below:

Y[n] = X(n)* W(n) (5.41)

5.2.10 FAST FOURIER TRANSFORM

To convert each frame of N samples from time domain into frequency domain FFT is being used. The Fourier Transform is used to convert the convolution of the glottal pulse U[n] and the vocal tract impulse response H[n] in the time domain.

(5.42)

If X(w), H(w) and Y(w) are the Fourier Transform of X(t), H(t) and Y(t) respectively.

5.2.11 MEL FILTER BANK PROCESSING

The frequencies range in FFT spectrum is very wide and voice signal does not follow the linear scale. Each filter’s magnitude frequency response is triangular in shape and equal to unity at the Centre frequency and decrease linearly to zero at centre frequency of two adjacent filters. Then, each filter output is the sum of its filtered spectral components. The following equation is used to compute the Mel for given frequency f in HZ:

(5.43)

5.2.12 DISCRETE COSINE TRANSFORM

This is the process to convert the log Mel spectrum into time domain using DCT. The result of the conversion is called Mel Frequency Cepstrum Coefficient. The set of coefficient is called acoustic vectors. Therefore, each input utterance is transformed into a sequence of acoustic vector.

5.2.13 DELTA ENERGY AND DELTA SPECTRUM

The voice signal and the frames changes, such as the slope of a formant at its transitions. Therefore, there is a need to add features related to the change in cepstral features over time. 13 delta or velocity features (12 cepstral features plus energy), and 39 features a double delta or acceleration feature are added. The energy in a frame for a signal x in a window from time sample t1 to time sample t2, is represented as shown below

Energy=Σ X 2[t] (5.44)

Where X[t] = signal

Each of the 13 delta features represents the change between frames corresponding to cepstral or energy feature, while each of the 39 double delta features represents the change between frames in the corresponding delta features.

5.2.14 MFFC ADVANTAGES

Characteristics of the slow varying part concentrated in the low cepstral coefficients.

Individual features of MFCC seem to be just weakly correlated which turns out to be an advantage for the creation of statistical acoustic model.

Does not have linear characteristics.

Mel scaling as been shown to offer better discrimination between phones, which is an obvious help in recognition.

It has good discriminating properties MFCCs are derived from the power spectrum of the speech signal, while the phase spectrum is ignored.

MFCC features are advantageous because it mimic some of the human processing of the signal.

5.2.15 DISADVANTAGES OF MFCC

A small drawback is that MFCCs are more computationally expensive than LPC due to the Fast Fourier Transform (FFT) at the early stages to convert speech from the time to the frequency domain.

First, they do not lie in the frequency domain.

However, it is well-known that MFCC is not robust enough in noisy environments, which suggests that the MFCC still has insufficient sound representation capability, especially at low SNR.

Though Mel Frequency Cepstral Coefficients (MFCCs) have been very successful in speech recognition, they have the following two problems:

(1) They do not have any physical interpretation, and

(2) Liftering of cepstral coefficients, found to be highly useful in the earlier dynamic warping-based speech recognition systems, has no effect in the recognition process when used with continuous.

The features derived from either the power spectrum are the phase spectrum have the limitation in representation of the signal.

5.3 WAVELET ANALYSIS

5.3.1 INTRODUCTION

There is no theoretical theory to support the MFCC to well represent a syllable without loss of information. In case of LPC and MFCC occurrence of noise is more in real time recognition system to avoid this we need to cascade enhancement to the system to avoid this problem, but in case of WAVELET due to filters present in it noise is remove without enhancement better recognition accuracy is possible as compare to other two techniques.

A wavelet is a wave-like oscillation with amplitude that starts out at zero, increases, and then decreases back to zero. It can typically be visualized as a "brief oscillation" like one might see recorded by a seismograph or heart monitor. Generally, wavelets are purposefully crafted to have specific properties that make them useful for signal processing. Wavelets can be combined, using a "reverse, shift, multiply and sum" technique called convolution, with portions of an unknown signal to extract information from the unknown signal.

The fundamental idea behind wavelets is to analyze according to scale. The wavelet analysis procedure is to adopt a wavelet prototype function called an analyzing wavelet or mother wavelet. Any speech signal can then be represented by translated and scaled versions of the mother wavelet. Wavelet analysis is capable of revealing aspects of data that other speech signal analysis technique such the extracted features are then passed to a classifier for the recognition of isolated words.

The integral wavelet transform is the integral transform defined as:

(5.45)

Where a is positive and defines the scale and b is any real number and defines the shift.

For decomposition of speech signal, we can use different techniques like Fourier analysis, STFT (Short Time Fourier Transforms), wavelet transform techniques.

5.3.2 WAVELET ANALYSIS

A wavelet is a waveform of effectively limited duration that has an average value of zero. Compare wavelets with sine waves, which are the basis of Fourier analysis. Sinusoids do not have limited duration they extend from minus to plus infinity. And where sinusoids are smooth and predictable, wavelets tend to be irregular and asymmetric.

Fourier analysis consists of breaking up a signal into sine waves of various frequencies. Similarly, wavelet analysis is the breaking up of a signal into shifted and scaled versions of the original (or mother) wavelet.

Mathematically, the process of Fourier analysis is represented by the Fourier transform:

(5.46)

Which is the sum over all time of the signal f(t) multiplied by a complex exponential. (Recall that a complex exponential can be broken down into real and imaginary sinusoidal components.)

The results of the transform are the Fourier coefficients, which when multiplied by a sinusoid of frequency yield the constituent sinusoidal components of the original signal.

Similarly, the continuous wavelet transform (CWT) is defined as the sum over all time of the signal multiplied by scaled, shifted versions of the wavelet function:

(5.47)

The results of the CWT are many wavelet coefficients C, which are a function of scale and position. Multiplying each coefficient by the appropriately scaled and shifted wavelet yields the constituent wavelets of the original signal.

5.3.3 WAVELET TRANSFORM

A signal can be represented in two domains. (i) Time domain (ii) Frequency domain in time domain we can know about time information of the signal whereas in frequency domain we can know the spectral or frequency information of the signal. Transform of a signal means representing a signal in other form so that the hidden information which is not available in normal form can be extracted. The transform of a signal does not change the information content of the signal. For example a signal mostly represented in time domain and cannot be found in the frequency information in this domain so to find out the frequency information we have to transform the signal in frequency domain by taking its Fourier transform. A Fourier transform of a signal gives the frequency information or spectrum nothing else. A Fourier transform of a signal gives the frequency information or spectrum nothing else. The time and frequency information can be found out simultaneously.

Wavelet Transform provides this information. It represents a signal in time as well as frequency domain by other words we can know the frequency information corresponding with time. Wavelet analysis of a signal is done by multiplying a wavelet function to the signal to be analyzed and then the transform is computed for each segment generated.

Transform of a signal can be found by using wavelet basis function known as mother wavelet and all other wavelet function used in transformation are derived by this basis function (mother wavelet) by translation and scaling. A continuous wavelet transform of signal can be represented by the following Equation.

(5.48)

Where x(t) is the signal to be analyzed and (t) is mother wavelet or basis function. Ï„ and s are translation an scale parameter respectively. Scaling parameter (s) is related to the frequency. Translation parameter (Ï„) is related to the location of the window as the window is shifted through the signal. It represents the time information in transform.

Computation of CWT

Take a wavelet and compare it to a section at the start of the original signal.

Calculate a number, C, that represents how closely correlated the wavelet is with this section of the signal. The higher C is, the more the similarity. More precisely, if the signal energy and the wavelet energy are equal to one, C may be interpreted as a correlation coefficient. Note that the results will depend on the shape of the wavelet you choose.

Figure 5.11 . Correlated value of Wavelet

Shift the wavelet to the right and repeat steps 1 and 2 until you’ve covered the whole signal.

Figure 5.12 Correlated value of shifted wavelet

Scale (stretch) the wavelet and repeat steps 1 through 3.

Repeat steps 1 through 4 for all scales.

5.3.4 WAVELET FAMILY

Wavelet family includes the following type of wavelet. Their use is depends on the type of application. Some of the major used wavelet functions are shown in figure 5.14 Haar wavelet is the oldest one whereas Daubechies wavelets are the most popular one and used in many applications including signal processing, image processing etc. The Haar, Daubeches, Symlets and Coiflets are supported orthogonal wavelets.

Figure 5.15: Wavelet Families

(a)Haar (b )Daubechies4 (c) Coiflet1 (d) Symlet2 (e ) Meyer (f) Morlet (g) Mexican Hat



rev

Our Service Portfolio

jb

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

whatsapp

Do not panic, you are at the right place

jb

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now