By Seth McNeill
The goal of this project is to write a program that will use a camera to detect motion. When it detects motion, it will ask who is moving, and then ask them for a password. It will have the capability to be trained, so more people can be added to the database. It will use word recognition to determine the name, and speaker identification (ID) to verify who says the password.
The word recognition (for names) will probably use Mel factor cepstral coefficients and a hidden Markov model (HMM). The speaker ID will probably use something similar, but tailored towards speaker dependant features, like the speech excitation.
For everything else, see the final report.
During this week I read chapter 14 from Quatieri's Discrete-Time Signal Processing Principles and Practice. This chapter describes the different methods and features used in speaker ID and verification. It turns out that one of the best feature sets is the mel-cepstrum coefficients. These are each the energy in one of the filters from a mel-scale filter bank. Usually, the first coefficient is dropped due to its sensitivity to scale changes.
Gaussian mixture models are often used (at least in this book) for modeling the data. These ignore the time dependence of the speech that HMMs would provide. I assume time dependence isn’t as important in speaker ID and verification, because we are just looking for the model of their voice, but not necessarily what they are saying.
For speaker verification, they used a background model or impostor to compare to the actual speaker’s model. In pattern recognition this is always a problem. In a face recognition project Dr. Nechyba showed us in pattern recognition they would input pictures without faces, and anything their algorithm recognized as a face was added to the not-face database. I’m not sure how I would create such a database for speaker verification. One way, is that every time the program thinks an imposter is the real person, and I catch it, I can add that to the database. My goal for now will just be some form of speaker ID (differentiating between known speakers) or threshold based verification. If the test case has a probability greater than some threshold, my program will say it is the person.
The chapter also discussed non-spectral features. These are usually used in conjunction with cepstral coefficients. I will probably use cepstral coefficients along with delta coefficients. Delta coefficients are the first derivative of the cepstral coefficients.
I have now started looking for C code to extract cepstral coefficients.
I need to know how to program in Linux/UNIX environments. I found several
different toolboxes of what I want for those environments, but not much
for the Windows environment. Finally, I went to Mark Skowronski’s webpage,
looked through the
he gave our class on 31 March 2003 and found the link I was looking for
on the HTK Hidden Markov
page. It was a
of ASR toolkits and software. I think the
MSState ASR Toolkit
may be what I need.
This week I worked on creating a speaker dependant recognition system in Matlab for homework 6. I have also been reading through some C code for making HMMs. Dr. Nechyba put the source code on his website for our projects last semester in EEL6825-Pattern Recognition.