Speaker ID

By Seth McNeill
For EEL6586 Automatic Speech Processing

Project Links
Seth's Homepage
Final Report

Project Description

The goal of this project is to write a program that will use a camera to detect motion. When it detects motion, it will ask who is moving, and then ask them for a password. It will have the capability to be trained, so more people can be added to the database. It will use word recognition to determine the name, and speaker identification (ID) to verify who says the password.

The word recognition (for names) will probably use Mel factor cepstral coefficients and a hidden Markov model (HMM). The speaker ID will probably use something similar, but tailored towards speaker dependant features, like the speech excitation.

Current Progress
24 April 2003

On 18 April 2003 I gave a presentation on my project so far. At that point I had just gotten some endpoint detection running. It didn't work very well in the presentation. You can download the presentation here . As of today, the whole program runs--you can record files for the training set, and then record individual test cases. The program outputs the likelihood for the training data and for the test case. You can download and try the program here . First record training data, then build model. After that you can either see what the average likelihood of the training data is, or record and then test a test case.

For everything else, see the final report.

11 April 2003

During this week I read chapter 14 from Quatieri's Discrete-Time Signal Processing Principles and Practice. This chapter describes the different methods and features used in speaker ID and verification. It turns out that one of the best feature sets is the mel-cepstrum coefficients. These are each the energy in one of the filters from a mel-scale filter bank. Usually, the first coefficient is dropped due to its sensitivity to scale changes.

Gaussian mixture models are often used (at least in this book) for modeling the data. These ignore the time dependence of the speech that HMMs would provide. I assume time dependence isn’t as important in speaker ID and verification, because we are just looking for the model of their voice, but not necessarily what they are saying.

For speaker verification, they used a background model or impostor to compare to the actual speaker’s model. In pattern recognition this is always a problem. In a face recognition project Dr. Nechyba showed us in pattern recognition they would input pictures without faces, and anything their algorithm recognized as a face was added to the not-face database. I’m not sure how I would create such a database for speaker verification. One way, is that every time the program thinks an imposter is the real person, and I catch it, I can add that to the database. My goal for now will just be some form of speaker ID (differentiating between known speakers) or threshold based verification. If the test case has a probability greater than some threshold, my program will say it is the person.

The chapter also discussed non-spectral features. These are usually used in conjunction with cepstral coefficients. I will probably use cepstral coefficients along with delta coefficients. Delta coefficients are the first derivative of the cepstral coefficients.

I have now started looking for C code to extract cepstral coefficients. I need to know how to program in Linux/UNIX environments. I found several different toolboxes of what I want for those environments, but not much for the Windows environment. Finally, I went to Mark Skowronski’s webpage, looked through the presentation he gave our class on 31 March 2003 and found the link I was looking for on the HTK Hidden Markov Model Toolkit page. It was a list of ASR toolkits and software. I think the MSState ASR Toolkit may be what I need.

2 April 2003

This week I worked on creating a speaker dependant recognition system in Matlab for homework 6. I have also been reading through some C code for making HMMs. Dr. Nechyba put the source code on his website for our projects last semester in EEL6825-Pattern Recognition.