Bryan Davis

EEL 6814

11/01/2001

Project 1

Introduction

In this project we are trying to classify segments of a guitar waveform as the note that was played. The data was run through a 256-point FFT and only the first 50 magnitude components were retained. Each note contains 149 50-point segments, and there are 12 notes. We took two approaches in trying to design this classifier. The first was that we tried to leverage information about the input domain, which theoretically would help us better to filter out noise, and to create discriminant boundaries between the classes. The second method is to use powerful generic algorithms to analyze the data and generate the discriminant boundaries. In general, the neural networks discussed so far in this course fall into the later category. That is, they are powerful generic algorithms that do not depend upon the input domain. It turns out that any attempt we made to use out knowledge of the input domain to reduce the complexity of the data, only served to reduce the performance of our classifier.

The computer used for simulations is an Intel Celleron running at 433MHz using Matlab 5.2.

Classifier Design

The classifier is divided into three parts: the first consists of normalization and quantization; the second is the preprocessor, which reduces the dimensionality of the data; the third is the neural-network classifier, which classifies the data using an MLP.

Normalization and Quantization

In order to speed up the training of the MLP, it is often helpful to normalize the weights. This is especially important when using first order methods, and less important for second order methods. But second order methods often include first order terms, and these terms can become dominate in non-quadratic regions of the weight space, weight convergence is still speeded up by normalization.

Quantization seems to greatly speed up convergence time, with little loss in information. If no preprocessing is used other than normalization and quantizing to eight quantization levels, the convergence rate tends to speed up by a factor of 5 over just using normalization.

Preprocessor

The input data space is very large (50 dimensional). With 12 output neurons, and N hidden neurons, the total number of weights becomes 51N + 12(N+1) = 63N + 12 = 516 when N = 8. Fortunately, the conjugate-gradient algorithm had no trouble with this large amount of weights (see the Results section for more details).

The proper preprocessing, however, might give even better results. Many methods were tried, and only a few of them had good performance. The idea was to choose a method of preprocessing that leveraged our knowledge about the application domain; namely to choose features that would seem useful in determining the fundamental frequency of a sampled waveform consisting of a fundamental tone and its harmonics and noise. The following paragraphs document the six methods tried and summarizes the results of each.

Method 1: Look at the positions of the K largest peaks

The idea behind this method was that for each note, the spectrum should peak at a few frequencies, and have lulls at the other frequencies. The location of the peaks should be relatively immune to noise, but the values at the peaks would be affected by noise. Thus we should choose as our input space the frequencies at which the K largest peaks occur. A problem would arise, however, if we were to order the peaks from largest to smallest, because two peaks of similar size could easily be transposed. This would cause a problem for the neural net, because these two locations would occur in very different regions of the input space. This problem can be reduced if the K largest peaks were ordered not their magnitude, but their frequency, so these transpositions would not occur. The problem that then arises is that the peaks on the threshold of either being included in the K largest or not, varied considerably between samples, thereby causing the same problem that transpositioning caused using ordered peaks. Another problem with this method is that the peaks would often not be at the same frequency in different samples, but at very close frequencies. This method performed very poorly for all values of K tested.

Method 2: Bin the data into bins of width D and choose the position of the largest value in the bin

This method is basically the same as method 1, except that now we are breaking the data up into bins and performing the method 1 analysis on each bin. This method mostly solves the problems of placing the correct peak frequencies at the correct encountered in method 1, because we are only choosing one peak from each bin instead of multiple peaks. It does not solve the problem of surrounding frequencies being chosen because of noise and sampling resolution. In addition, it creates a new problem -- that is, the peaks in each bin are similar between the classes, and thus discriminability is difficult. This method also performed poorly for the values of D tested.

Method 3: Bin the data into bins of width D and choose the average frequency in each bin

This method is similar to method 2, but whereas in method 2, the peak values in each frequency bin were chosen, here we choose the average frequency in each bin. The advantage to this method over method 2 is that we have a continuous range of values that the average frequency can take, instead of a small range of discrete values. This provides potentially more information per input dimension. The problem with this method is that many waveforms will have the same average value in a bin, and thus we lose the discriminability between these waveforms. This method produces results much better than methods 1&2, but worse than the following methods.

Method 4: Bin the data into bins of width D and choose the average energy in each bin

This method is similar to method 3, in that we are averaging over each bin, but it differs from all of the previous methods in that instead of looking at the location where the peaks or averages occur, we are looking at the values of the peaks and averages themselves. This method seems at first to lose the most important characterization of a note, that is the fundamental frequency. But if we look at the data we are given, the notes are separated enough so that if we choose the bin width correctly, the energy in each bin, most of which is associated with a harmonic, occurs in different bins for at least some of the harmonics in the different notes. Note also that this method is equivalent to taking the average value of D samples and using that average as an input to the classifier. This method performs the best so far.

Method 5: A combinations of Methods 3 & 4

The idea behind this method is that we will combine the two best classifiers so far and see if we can gain any information from the correlation between the two. Since they both are very similar in construction, but measure different characteristics, it makes sense that they could provide us with better classification. Experimentally, this turns out to be true, but at the cost of twice as many weights. If we used those extra weights to get more bins out of Method 4, we would be better off. This method performs equally as well as using no preprocessing, when the bin width D = 5. When D = 10, this gives us close to an ~80% reduction in the number of weights.

Method 6: Principle Component Analysis

The idea behind this method is to use the second-order statistics of the data to determine the orthogonal linear combinations of the inputs that produce the greatest variance in the data and those that produce the smallest variance, without regard to the desired response. We keep the linear combinations that produce the largest few variances, and discard the other dimensions. This method is general to all applications and uses no particular information from this application. The parameter that is adjusted in this method can either be the number of dimensions used in the new input space, or the percentage of energy that should be discarded. We choose to optimize the later, as it is concerned with the preservation of the information contained in the input space, rather than the amount of weight reduction. This method performs better than any other method used, even the method of using no preprocessing at all.

Neural-Network Classifier

To do the job of classification, a one-hidden-layer MLP was used. A few different training algorithms were used: the Levenberg-Marquardt, conjugate-gradient decent, and back-propagation with momentum. Through trail and error, it was found that the conjugate gradient decent provided the best results. The Levenberg-Marquardt algorithm is too slow to use with a large number of weights (since it is O(W²) ) and it converges too local minima in a few iterations, and thus does not give good global results. The back-propagation w/ momentum is also very slow, and also does not converge to global minima nearly as well as the conjugate-gradient. The conjugate-gradient was much faster than the other methods, and it converged to a good minimum, even without preprocessing.

The adjustable parameter in the conjugate-gradient algorithm is the first order factor in the update, l. This factor is used to remove some of the second order effects when performing the line search. The second order effects can cause the line search algorithm to diverge in non-quadratic regions of the line. This factor l is multiplied by a growth factor l_up when the updated weights produce a larger error that the previous weights, and they are reduced by a factor l_down when the error is smaller than the previous weights. These factors are determined experimentally and we found that values of l_up = 4, and l_down = 0.8 work well.

Because we have a relatively large amount of data available, cross-validation was used to determine when to stop training. Sixteen percent of the non-testing data was used for cross-validation, 64% was used for training, and the other 20% was reserved for the test set. To determine when to stop, we tested the MLP every nth iteration (n being around 20 was optimal) using the cross-validation data to see if it decreased from the last time we tested it. Once the cross-validation error began to increase, we stopped training. The reason why we did not check the cross-validation error every epoch was that there was inevitably some stochastic noise in the error due to the small change in weights and limited size of the cross-validation set. This noise could cause the training to end prematurely. If the cross-validation error is checked every few epoch, it provides a greater change in the weights over which to check the error. The drawback is that we will not have as a refined an estimate of when to stop training, and we could miss the optimal stopping point by as much as n epochs.

The number of hidden neurons used was a trial and error process as well. Generally, the best results (in the test set) were obtained when from around 10-15 hidden-layer neurons were used. Reasonable results were obtained using as few as four hidden-layer neurons. Using any more than around 20, generally resulted in an increase in test-set error, as well as training time. Since we are trying to classify 12 classes, a number of basis functions of around 12 seems reasonable.

Results

If we just use the unprocessed normalized data without quantization or preprocessing, we get reasonable results, but better results are obtained using proper quantization and preprocessing techniques.

Preprocessing

For the preprocessing tests, 8 hidden neurons, and the conjugate-gradient algorithm were used.

No Quantization or Preprocessing

Here we are only normalizing the data. We are using 8 hidden nodes, because it seems to be about the smallest number we can use, without losing too many error points. Because there are 516 weights, the system takes a relatively long time for each epoch (about half a second).

The average number of correct classifications in the test set is 80% (for this particular set of weights), and it took the slow computer 115 seconds (320 epochs) to complete training. The confusion matrix for one run of the system is given by:

where the columns represent the classifier’s response, and the rows represent the desired response. The data is normalized to 1 for each row, thus the cell i,j gives us the percentage of class i that was classified as j by our classifier. The notes are ordered from lowest to highest frequency (E2 – D#3). The data shown is the data in the test set. All confusion matrices that follow are presented in the same way.

The total classification error for this system was 93%, which is very good. It took some time to train however (115 seconds).

8 Quantization Levels, No Preprocessing

The convergence rate over the unquantized data is apparent. The confusion matrix is equally as accurate than the 15 hidden-neuron case. Here 92% of the data is classified correctly. Note that the system converges after about 60 epochs in this case. Where as, it took 300 epochs to converge in the unquantized case.

Preprocessing Method 1

All of the following methods use 8 quantization levels on the inputs.

This method reduces the dimensionality of the input space to a size of K. For this example, we choose K = 5, but the method works equally as bad for all values of K that we used in testing.

Clearly, this method of dimensionality reduction does not work well with the MLP.

Preprocessing Method 2

This method reduces the dimensionality of the input space by a factor of D. For this example, we choose D = 5, but the method works equally as bad for all values of D that we used in testing. This method works slightly better than method 1, but it is still not good.

Preprocessing Method 3

This method reduces the dimensionality of the input space by a factor of D. For this example, we choose D = 5, which is about the optimal value. Anything less than 5, and the number of peaks to choose from in a bin is too small, and anything greater than 5 creates too few bins. For this example we get a total test error of 88%, which is pretty good.

Preprocessing Method 4

This method reduces the dimensionality of the input space by a factor of D. For this example, we choose D = 5, which is about the optimal value. For this example we get a total test error of 91%, which is very good.

Preprocessing Method 5

This method reduces the dimensionality of the input space by a factor of D/2. It is a combination of Methods 3&4. For this example, we choose D = 10, which allows us to compare this method with methods 3&4, since for D = 5, they are using 5 fold dimensionality reduction. This method does gives results that are comparable to methods 3&4.

Preprocessing Method 6 - PCA

PCA works very well. The optimum value for the amount of energy that we discard seems to be around 10%. Using this value we obtain the following results. The total error is 91 %.

Number of Hidden Neurons

Here we are trying to find the best number of hidden neurons to use. This number will depend on the other parameters somewhat, but it is inherently independent of the training methods, and dimensionality reduction. For this test, PCA with 10% energy loss, 8 level quantization, and conjugate-gradient training was used.

4 Hidden Neurons

As we can see, 4 hidden neurons did not give the MLP enough freedom to capture all of the data points. The total correct classification percentage was 81%.

8 Hidden Neurons

These results were already created, but are displayed again for clarity.

16 Hidden Neurons

As shown here, 16 hidden neurons performed the best with 93% correct classification. This is not that much better than 8 hidden neurons, and the variability is higher (more likely to find local minima).

32 Hidden Neurons

With this number of neurons, the test data performance shows a notable decrease.

Conclusion

The conjugate-gradient method handles this problem very well. It produces good results even when no dimensionality reduction is performed on the input space. Quantization improves the convergence rate. PCA Dimensionality and averaging energy in bins (Method 4) reduction slightly improves the convergence weight in terms of epochs, but greatly improves convergence rate in terms of computation time. Note that the preprocessing methods did not really improve the performance of the system, but rather they improved the training time.