Speech Recognition Using Recurrent Neural Networks
by Bryan Davis
University of Florida
Isolated-word speech recognition can be divided into two main data reduction problems:
Short-term time variance (<10ms)
Long-term time variance (>10ms)
This project focuses on long-term time-variance. The most prevalent method to do this is using HMMs, but it isn’t the only method…
Plot of 2 (of 13) MFCC Features vs. Time
Why Use a Neural Net?
Potentially less computation during operation mode than HMMs (no Viterbi search)
Distributed computation makes them much faster if implemented using specialized hardware
They’re Cool!!
But... they take a long time to train
Use a Neural Net … But How?
A neural net with an appropriate design should be able to find these features. But what design?
The network must be time-dependent
Time-delayed feed-forward neural nets (TDNNs) don’t work, because the signal isn’t stationary (why?)
Recurrent Neural Nets are needed to account for contextual information
Inputs and Outputs
One RNN for entire system
10 outputs, 1 for each word
Too hard…many problems
One RNN for each word (as when using HMMs)
Train individually using k-step prediction
Use MSE to classify during operation
Still hard to do, but some positive results
More Complications
An Elman network was used (the feedback connection is an exponentially decaying integrator)
The input layer is a gamma-filtered TDNN (time-delay NN) with 10 delay taps and memory depth of 100
Results
80% correct recognition using HW #4 data without added noise (95% correct using HMMs on equivalent inputs)
Obtained on non-optimized and poorly trained networks, due to time constraints
Better results are expected with more training and optimization
Conclusions
The RNNs used are not well suited to classify functions in time directly
Training separate RNNs for each word using prediction works quite well
RNNs could be combined with HMMs or other ANNs for continuous word recognition
Further Work
Train with noise added to the speech
Apply global optimization to the fixed parameters in the RNN
Try different NN architectures
Try to find a better method of comparing the outputs other than weighted mean squared error (another NN, fuzzy rules, entropy measures, etc.)