Speech Recognition Using Recurrent Neural Networks

by Bryan Davis

University of Florida

 

Isolated-word speech recognition can be divided into two main data reduction problems:

Short-term time variance (<10ms)

Long-term time variance (>10ms)

This project focuses on long-term time-variance. The most prevalent method to do this is using HMMs, but it isn’t the only method…

Plot of 2 (of 13) MFCC Features vs. Time

Why Use a Neural Net?

Potentially less computation during operation mode than HMMs (no Viterbi search)

Distributed computation makes them much faster if implemented using specialized hardware

They’re Cool!!

But... they take a long time to train

Use a Neural Net … But How?

A neural net with an appropriate design should be able to find these features. But what design?

The network must be time-dependent

Time-delayed feed-forward neural nets (TDNNs) don’t work, because the signal isn’t stationary (why?)

Recurrent Neural Nets are needed to account for contextual information

Inputs and Outputs

One RNN for entire system

10 outputs, 1 for each word

Too hard…many problems

One RNN for each word (as when using HMMs)

Train individually using k-step prediction

Use MSE to classify during operation

Still hard to do, but some positive results

More Complications

An Elman network was used (the feedback connection is an exponentially decaying integrator)

The input layer is a gamma-filtered TDNN (time-delay NN) with 10 delay taps and memory depth of 100

Results

80% correct recognition using HW #4 data without added noise (95% correct using HMMs on equivalent inputs)

Obtained on non-optimized and poorly trained networks, due to time constraints

Better results are expected with more training and optimization

Conclusions

The RNNs used are not well suited to classify functions in time directly

Training separate RNNs for each word using prediction works quite well

RNNs could be combined with HMMs or other ANNs for continuous word recognition

Further Work

Train with noise added to the speech

Apply global optimization to the fixed parameters in the RNN

Try different NN architectures

Try to find a better method of comparing the outputs other than weighted mean squared error (another NN, fuzzy rules, entropy measures, etc.)