1. Introduction

In real life, speech is usually subject to noise and distortion, which result in the loss of intelligibility of speech message. Therefore, the enhancement of speech corrupted by noise is an important problem with numerous applications such as suppression of environmental noise for communication systems and hearing aids, enhancing the quality of old records.

The purpose of speech enhancement is to improve some perceptual aspects of speech for the human listener or to improve the speech signal so that it may be better exploited by other speech processing algorithms.

Speech enhancement depends on signal processing and human perceptual factors. Since speech quality and intelligibility are dependent on short term spectral amplitude and insensitive to spectral phase, the speech is always considered stationary over a short period of time (10ms to 20ms) and it is processed frame by frame. Some widely used processing methods are spectral subtraction, Wiener filtering and iterative Wiener filtering.

In this project, we focus on spectral subtraction. The data is segmented and windowed using 50% overlapping and is processed frame by frame. The original phase is preserved while the spectral magnitude components are modified.


2. Statement of the Problem

The task of the project is to design an enhancement algorithm using spectral subtraction to get an estimate of the uncorrupted speech while achieving the highest possible intelligibility. Since it is hard to describe mathematically, we instead use the mean squared value of the estimated error e as a measurement of this quality as well as the SNR improvement.  

The properties of the noise as well as the nature of corruption vary in various scenarios, so we make the following assumptions in our project. The background noise is slowly varying, additive and independent with the speech. It is relatively long-time stationary compared with the speech so that its spectral magnitude expected values during the speech activity remain almost the same as prior to the speech activity.

The noise model is given as follows:

where is the original speech, is the additive noise and  the corrupted speech. Taking the Fourier transform, we get

where ,  and .

The structure of the problem is presented in figure 1.

Figure 1 Structure of the problem

3. Spectral Subtraction Approach

In this project, we have implemented two spectral subtraction approaches. In both algorithms, the input noisy speech data sequence is segmented and windowed into 50% overlapped frames. After spectral subtraction and all other processing, the estimated speech signal segments are overlapped and added together to form the enhanced speech sequence.


3.1 Spectral Subtraction with Voice Activity Detection (VAD)

In this algorithm, we use VAD to determine whether a signal sequence frame should be considered as speech or non-speech. Noise spectrum estimates are calculated based on these frames considered as non-speech. After spectral subtraction, we apply residual noise reduction and additional attenuation during non-speech activities to reduce the well-known “musical noise” as well as to further reduce the noise level. Figure 2 shows the structure of this algorithm.  

Figure 2 Block diagram of spectral subtraction using VAD


In each speech frame, the energy in the frame, E, the linear prediction error normalized with respect to the energy of the signal, LPE, and the zero-crossing rate, ZCR, are calculated. Using all these three parameters, a compound parameter, D, is calculated as D=E(1-ZCR)(1-LPE). From all the frames of the signal, is computed. Then the value of is used to determine whether a signal has speech activity or not. The threshold values have to be obtained empirically. The frames are classified as speech and non-speech frames. The non-speech frames will be used to obtain the noise magnitude spectrum estimate.


After doing FFT, we use the average magnitude of three frames as the spectral estimate. Taking average of 3 frames is to reduce the variance of the noise spectral estimate while not violating the non-stationarity of speech too much.  

Noise Estimation:

The noise estimation of each frequency bin is obtained by averaging the signal magnitude spectrum from the non-speech frames. This is valid because we assume the noise is long-time stationary. Let be the estimate of noise spectral magnitude.

Spectral Subtraction and Half-Wave Rectification:

The spectral subtraction estimate  is obtained by subtracting the expected noise magnitude spectrum  from the magnitude signal spectrum . Thus, 

,  and

where FFT length and the number of frames.

After subtracting, the differenced values are set to zero if having negative magnitudes.  

Residual Noise Reduction:

For each frequency bin i  

where = maximum value of noise residual (after spectral subtraction) measured during non-speech activity. This process is to reduce the commonly known “musical noise”.


Additional signal attenuation during non-speech activity

During non-speech activities, we would like to further attenuate the noise for more noise reduction. Let , where is the noise spectral estimate.

Then, the output spectral estimate including additional attenuation during non-speech activity is given by

for T

  , otherwise.

where . The threshold of T is determined empirically.


3.2 Spectral Subtraction with Time-Frequency Filtering

The second approach is a modified version of the first one. In this algorithm, we do not use VAD to determine whether a frame should be considered speech or non-speech. Instead, we use a recursive system to estimate the noise spectral estimate. If the noisy speech spectral magnitude is much larger than the estimated noise magnitude, then we considered it as a rough detection of speech. Therefore, recursive accumulation stops and the prior noise estimate is used as the noise estimate for the current frame, since we assume the noise is long-time stationary compared with the speech.  

We use a time-frequency filtering instead of residual noise reduction to reduce the “musical noise”. Time-Frequency filtering is performed using several preceding and frames and several frames following the frame of interest. This filtering is to locate isolated peaks followed by attenuation.

Figure 3 shows the structure of this algorithm.

Figure 3 Block diagram of spectral subtraction with time-frequency filtering

The first algorithm does not work well when the background noise is not stationary or SNR is low. The following method does not need speech pause detection and hence should work better at very low SNR. 

This method differs from the first method in several aspects.

1.     There is no longer any VAD in this algorithm. Instead, we use a recursive system to estimate the noise spectral magnitude.

where  denotes the spectral magnitude at frame k in subband (frequency bin) i of the noisy speech sequence. Since considerable higher values occur at the onset of speech, a threshold is introduced where takes a value in the range of about 1.5 to 2.5. When the actual spectral component exceeds this threshold, this is considered as a rough detection of speech and the recursive accumulation is stopped. Due to the assumption of long-term stationarity, we can use the noise estimate prior to such frames as the noise estimate of them.


2.     Time-Frequency filtering

We use time-frequency filtering to reduce the musical noise. It is performed using several preceding and frames and several frames following the frame of interest. The area of analysis is defined by the two regions shown below, region A and region B.

Figure 4 Analysis regions



The decision as to whether or not region B contains an isolated peak is made by

, then region B contains an isolated peak

Otherwise, region B does not contain an isolated peak

where specifies the ratio of energies in region A and B. Peaks that appear in several successive frames and occupy a wide, or close to other spectral peaks are likely to be due to speech and are unmodified by this procedure. On the contrary, isolated peaks are typically caused by noise and hence removed by this step.

If region B is considered to contain an isolated peak, then

 for  and .

Otherwise, we keep the original spectral components unchanged.


3.     Noise estimation

After we get , we multiply it by  to estimate noise spectrum. The usual range of is 1.5 to 2.5. Large tends to remove more noise, but also introduce more speech distortion.

4.     Modified half-wave rectification

After subtraction, the difference values having negative magnitudes are not set to zero. However, we take the absolute value of them and attenuate by 50dB. 

4. Results

For both methods, Hanning window with 50% overlapping is used to segment and window the input data. After spectral modification, the segments are overlapped and add together to obtain the enhanced speech.

Three noisy speech samples have been enhanced by the above two algorithms.

1.     Speech sample recorded in a F16 cockpit voice recorder.

2.     Speech sample recorded in a factory.

3.     Speech sample recorded in a car.

Before converting spectral magnitude back to time domain, a smoothing of the PSD is carried out for every frame to avoid large variation. The equation to smooth the mth frame is given by:

where  ranges from 0.5 to 0.9. In the project, we use . This smoothing is done for both methods.  


Method One:

In the project, we set the threshold of  as 0.05 and as -12dB.

Since the threshold of  is determined empirically, the VAD works well and detects most of the speech activities. The following figure gives a typical example of VAD, the threshold is set at 0.05.

 Figure 5 Typical VAD results

In this algorithm, we take the average of the spectral magnitude of all frames which are considered as non-speech as the noise estimate. In real time systems, the noise estimate will be updated by every new input frame.

The following three figures show the waveforms of the clean signal, noisy signal and the enhanced signal in time and frequency domain recorded in a F16 cockpit voice recorder, in a factory and in a car, respectively.

             Figure 6(a) Time domain waveforms of  clean,                       Figure 6(b) Spectrogram of clean, noisy 

                        noisy and enhanced speeches (F16)                                  and enhanced speeches (F16)

            Figure 7(a) Time domain waveforms of clean,                        Figure 7(b) Spectrogram of clean, noisy 

                     noisy and enhanced speeches (factory)                               and enhanced speeches (factory)

             Figure 8(a) Time domain waveforms of clean,                        Figure 8(b) Spectrogram of clean, noisy 

                    clean, noisy and enhanced speeches (car)                              and enhanced speeches (car)

From Figure 5-7, we can see that, the proposed algorithm does make some improvement. It does not work very well, especially for the speech recorded in a factory. In addition, there is still some distorting in the enhanced speech. Moreover, the “musical noise” is present in all three enhanced speeches, and it is very difficult to reduce it. Therefore, the intelligibility of the so-call enhanced speeches may be even worse.


Method two:

In the project, we use ,,,,,and .

Figure 8-10 show the waveforms of the clean signal, noisy signal and the enhanced signal in time and frequency domain recorded in a F16 cockpit voice recorder, in a factory and in a car, respectively. 

             Figure 9(a) Time domain waveforms of  clean,                       Figure 9(b) Spectrogram of clean, noisy 

                        noisy and enhanced speeches (F16)                                  and enhanced speeches (F16)

            Figure 10(a) Time domain waveforms of clean,                        Figure 10(b) Spectrogram of clean, noisy 

                     noisy and enhanced speeches (factory)                               and enhanced speeches (factory)

             Figure 11(a) Time domain waveforms of clean,                        Figure 11(b) Spectrogram of clean, noisy 

                    clean, noisy and enhanced speeches (car)                               and enhanced speeches (car)

From the above figures, we can see that the second algorithm works better than the first one, especially for the noisy speech recorded in a factory. However, there is still some “musical noise” in the enhanced speech.  

The recorded clean and corrupted speech under various noise sources, and enhanced speech using different methods are listed in the follow table.

Background noise

Clean Speech

Noisy Speech

Enhanced Speech

Method 1

Method 2

F16 Noise

Factory Noise

Volvo Noise

Table 1  Clean, noisy and enhanced speech

The following tables show the SNR improvement by method two for four different sentences for four different background noises: F16 noise, factory noise, white noise and car noise.

Sentence One: “Good service should be rewarded by big tips.

Background noise

F16 Noise

Factory Noise

White Noise

Volvo Noise

Input SNR





Output SNR





SNR Improvement





Table 2 SNR improvement for F16 noise

Sentence Two: “The fifth jar contains big juicy peaches.

Background noise

F16 Noise

Factory Noise

White Noise

Volvo Noise

Input SNR





Output SNR





SNR Improvement





Table 3 SNR improvement for factory noise  

Sentence Three: “Draw every outline first then fill in the interior.

Background noise

F16 Noise

Factory Noise

White Noise

Volvo Noise

Input SNR





Output SNR





SNR Improvement





Table 4  SNR improvement for white noise  

Sentence Four: “Scholars argue history.

Background noise

F16 Noise

Factory Noise

White Noise

Volvo Noise

Input SNR





Output SNR





SNR Improvement





Table 5 SNR improvement for car noise

From the above tables, we can see that the higher input SNR result in larger SNR improvement. This is because for higher input SNR, the speech signal is less corrupted so that it is easier to restore the speech from background noise and hence achieve larger SNR improvement.

5. Discussion

For the first method, we found that although the residual noise reduction would reduce the musical noise a little, it would also cause great distortion of the speech. Therefore in practice, we do not use this step in the project. Hence the musical noise is very obvious in the first method.

For the second approach, the recursive noise estimation system is acting roughly as a VAD. is chosen so that most of the onset of the speech can be detected. Empirically, we set it as 2. determines the level to which the noise is removed. Too small values of  do not make the approach to work well for the purpose of reducing noise, while too large values will cause too much speech distortion. During our project, we found that =1.5 is a good choice. Other choices of parameters are also determined empirically.

Although we use time-frequency filtering and additional noise reduction during non-speech activity, there still exists some musical noise. The next step of the project will focus on this issue.



