December 5, 2001

Abstract

This project classifies the sound produced by different wind instruments according to the instrument that produced the sound. Eight instruments were used in classification, and the sound was divided into separate notes and preprocessed to obtain the fundamental frequency and relative amplitudes of its harmonics. The best results were obtained using 1-NN classification on the amplitudes of the harmonics, but only considering sounds that were within a semitone of the test note as possible neighbors. The error produced for this best case was 20%.

Wind Instrument Classification

Abstract

Table of Contents

Introduction

Properties of the Data Set

Without Pitch Dependence

Dumb Classifier (no features)

1-NN

2-NN

Neural Networks

Bayes Classifier Assuming Normal Distributions

With Pitch Dependence

Dumb Classifier (no features other than pitch)

Bayes Classifier Assuming Normal Distributions

1-NN classifier

k-NN

Dimensionality Reduction

Number of Harmonics

Principle Component Analysis (PCA)

Multiple Discriminant Analysis (MDA)

Conclusions

References

Introduction

The goal of this project is to create a program that can identify which instrument from a set of musical instruments as producing a given sound sample. Ideally, this classifier should work for any instrument of the exact same type, e.g. any bassoon. It should also be able to classify the instrument, regardless of the playing style of the musician (e.g. note onset time and duration), the loudness of the note, or the pitch of the note. This report documents the methods used at arriving at the optimal classifier. It attempts to explain the reasoning as to why specific choices were made in the quest to find the best classifier.

Properties of the Data Set

Instruments

In order to create a decent classifier, a good set of sound samples was need. A good, clean set of samples was found at the University of Iowa’s Musical Instrument Samples page. This web site provided CD quality samples recorded in an anechoic chamber of eight different wind instruments. The data set contains samples from eight instruments: Alto Saxophone, Bassoon, Bb Clarinet, Eb Clarinet, Flute, Horn, Oboe, and Soprano Saxophone. All of these played by blowing air across a flue or between two vibrating reeds.

Sampling Rate

The data was sampled at a rate of 44.1 kHz, which is good for the features used. E.g. the fundamental frequency for C7 (three octaves above middle C) is 3520 Hz. Thus the 6^th harmonic of this frequency is 21.12 kHz, which is just below the Nyquist frequency for this sample rate. A sample rate of 22.5 kHz or worse, 11.25 kHz would have limited the number of harmonics available for the high notes even more.

Acoustics

The instruments were also played in an anechoic chamber. This effectively eliminates any corrupting outside noise, but more importantly, it removes any acoustic properties of the room, which might resonate with some frequencies. On the other hand, this classifier might not work well for instruments played outside the anechoic chamber, because the room that they are in will influence the harmonic signature of the sound being heard.

Playing Style

By listening to the notes, it was assumed that the instruments were played by a few different players, or at least by a different player with varying styles. The clue for this assumption is that the notes were played with varying durations and inter-note silences, although the web site implicitly claims that the note durations and spacings are the same. This may add to the robustness of this classifier as invariant to playing style, although no particular style could be perceived.

Note Variations

The data set was recorded as a series of scales played over the full dynamic range of each instrument. Thus in order to use the classifier on individual notes, the notes needed to be extracted from the scales. Each scale for each instrument was played at three different volume levels: fortissimo (very loud), mezzo forte (medium loud), and pianissimo (very soft). Some instruments (Alto Saxophone, Flute and Soprano Saxophone) were also played using vibrato, or varying the volume and pitch of the note rapidly.

Preprocessing

Why Preprocess?

The WAV files contain 16*44100 = 705.6 kbits of information per second of sound. This feature space is clearly too large to use for classification. Preprocessing must be employed to reduce the dimensionality of the data set by throwing away features that have within-class variance similar to the between-class variance.

Feature Invariance

Because wind instruments are used, the player of the instrument has total control over the macroscopic temporal features of the note, and thus it would not be accurate to use those features in classification of the instrument alone. Thus the features used in this project were microscopic temporal features, or equivalently, macroscopic frequency features.

In order for the frequency features to be invariant to the pitch of the note, the frequency spectra must be shifted in order to normalize the effects of the different pitches. The frequency spectra of notes produced by wind blown instruments have definite peaks that occur at frequencies that are multiples of the fundamental frequencies. These frequencies are known as harmonics, and this phenomenon is due to the properties of the vibrating sound source. Because these features are fundamental to the sound source (in this case, the vibrating instrument body), it was believed that they would make good features for a classifier. This is assuming that different wind instruments would have different harmonic signatures. Also, the fundamental frequency was also used as a feature. It was converted to octave so as to increase linearly as the pitch.

The amplitudes of the notes were normalized so that the peak value was 1. This negates any effect of amplitude variance between loud and soft notes, or notes played closer or farther from the microphone.

Implementation

The method used to obtain the fundamental frequency was to use a power spectral density estimate of the note, and search for the first value that occurred above a certain threshold. Then the maximum value in the PSD that fell within a small region around this first value was chosen to be the fundamental frequency. The amplitudes of the harmonics were found in a similar way: find the maximum value in a region around the predicted value of each harmonic.

To use the harmonics in a classifier, a constant number of harmonics must be used. Since the maximum available harmonic is determined by the fundamental frequency and the Nyquist rate, it will vary. We set the number of harmonics to a constant value (20). For low-pitched notes, the harmonics were truncated, and for high-pitched notes, the unavailable harmonics were assumed to be zero.

Classification

Classification was done on the amplitude of twenty harmonics. The methods used were: Nearest-Neighbor (NN), Bayes assuming a normal distribution, and Neural-Networks. Principle component analysis (PCA) and the multiple discriminant analysis technique (Duda & Hart 2001) were used to reduce the dimensionality.

As these methods did not produce very good results, the data was analyzed to determine as to why the results were so poor. It turns out that the harmonic signature of an instrument varies as the pitch of the instrument changes.

This makes sense, because the vibration modes of the instrument change as the nodes pressure nodes of the instrument are modified. When an air hole is opened or closed, it changes the location of the displacement nodes on the instrument. The displacement nodes are what determine the pitch of the instrument. Changing the displacement nodes should also then change the relative intensities of the harmonics, because it is like creating another instrument for each note. It turns out though that the harmonic signature of adjacent notes (on the same instrument) is similar, while the harmonic signature of notes that are far away in pitch are dissimilar. This is effectively like having many different instruments in one.

To account for this fact, when classifying a note, we only compare it to other notes of the same frequency. This technique was also used by I. Kaminskyj to improve classifier performance (see Multi-feature Musical Instrument Sound Classifier). These properties are illustrated in figures 1-4.

Figure 1 shows the harmonic values of the adjacent notes Bb, B, middle C, Cb, D and Eb played on the Alto Saxophone with fortissimo and without vibrato. Notice that the harmonic signature does not vary much between notes.

Figure 2 shows the harmonic values of separated Bb3, E4, Bb4, E5, Bb5 and E6 notes played on the Alto Saxophone with fortissimo and without vibrato. Notice that the harmonic signature does vary much between notes.

Figure 3 shows the harmonic values of middle C played on the Alto Saxophone in all six different styles. The harmonic signature varies between notes more significantly than in figure 1. This figure illustrates the within-class scattering of the notes used in the pitch-dependent classifier.

Figure 4 shows the harmonic values of middle C played on each instrument with fortissimo and without vibrato. The harmonic signature varies greatly between notes. This indicates that the between class scattering for the harmonic signature is large. This value is most comparable to figure 1. Figure 3 has six different playing styles, while this figure only uses one playing style.

To do this however, we must remember the data from all of the notes in our classifier, as is the case for a nonparametric classifier. Thus parametric classifiers such as Bayes and neural networks aren’t very suitable, because they must be retrained with training data for each note they are to classify. Alternatively, we could make a separately trained classifier for each note in the total dynamic range of all of the instruments. This is rather impractical, especially when nearest-neighbor is suitable for this problem with little modification. Bayes is not that difficult if we recalculate the parameters for each sample, but it losses many of the advantages of being a parametric classifier.

Results

Without Pitch Dependence

Dumb Classifier (no features)

If no features can be used, the best classifier would always choose the Flute, since it has the highest a priori probability (it has the most samples, and the samples are randomly chosen). This would result in a re-substitution success rate of 227 / 1143 = 20%, or an error rate of 80%.

1-NN

Here is the confusion matrix for 1-NN leave-one-out classification error. It did well classifying the two Saxes and the Horn, but it did rather poorly on the other instruments.

	Bassoon	Alto Sax	Sopr. Sax	Oboe	Bb Clar	Flute	Horn	Eb Clar	Guess/Actual
Bassoon	53	3	2	8	2	16	4	7	0.84
Alto Sax	2	140	21	1	11	13	6	6	1.08
Sopr. Sax	6	11	131	0	12	15	6	11	1.01
Oboe	3	0	1	16	1	4	3	0	0.47
Bb Clar	5	4	4	5	41	11	0	26	0.81
Flute	21	19	15	16	21	142	5	16	1.12
Horn	16	5	10	10	3	20	106	2	1.31
Eb Clar	7	3	6	4	27	6	1	51	0.88
Error rate	53%	24%	31%	73%	65%	37%	19%	57%
Total Error	41%

2-NN

Here is the confusion matrix for 2-NN volumetric. NN voting was not used because voting doesn’t resolve ties when there are more than two classes. 2-NN in this context means that the neighbors are searched from closest to furthest until two of the same class is found.

	Bassoon	Alto Sax	Sopr. Sax	Oboe	Bb Clar	Flute	Horn	Eb Clar	Guess/Actual
Bassoon	29	4	1	5	1	6	1	4	0.45
Alto Sax	7	87	22	0	11	10	5	6	0.80
Sopr. Sax	8	31	100	1	4	19	7	10	0.95
Oboe	4	2	0	14	3	5	0	0	0.47
Bb Clar	6	9	9	4	30	10	0	24	0.78
Flute	28	23	28	15	20	132	4	28	1.22
Horn	17	23	14	17	8	25	112	6	1.69
Eb Clar	14	6	16	4	41	20	2	41	1.21
Error rate	74%	53%	47%	77%	75%	42%	15%	66%
Total Error	52%

Note that 2-NN is worse than 1-NN. This trend continues with an increasing k. This indicates that not enough data points are being used to get an accurate estimate of the distributions. This is due the highly non-linear relationship between the harmonic signature and the pitch.

Neural Networks

The neural network we used was unable to find a good solution. We used a single hidden layer neural net and we varied the number of hidden nodes. The conjugate gradient algorithm was used for training (because of its speed). Cross-validation was used for a stopping criterion, and the Train-CV-Test ratios are 56-14-30. The best neural network we could find produced the following results, which are pretty bad. Basically this neural net did well by choosing the Flute very often (it choose the Flute 2.11 times more than the number of Flute samples, highlighted in cyan) because this is the instrument with the most samples (and hence the highest a priori probability).

	Bassoon	Alto Sax	Sopr. Sax	Oboe	Bb Clar	Flute	Horn	Eb Clar	Guess/Actual
Bassoon	11	2	5	4	1	9	1	0	1.06
Alto Sax	2	31	14	5	4	2	3	7	1.21
Sopr. Sax	3	2	19	3	2	0	2	0	0.51
Oboe	0	0	0	0	0	0	5	0	0.23
Bb Clar	1	2	0	0	3	1	0	5	0.39
Flute	8	17	21	4	14	51	7	15	2.11
Horn	6	1	2	5	4	2	26	2	1.07
Eb Clar	0	1	0	1	3	0	1	3	0.28
Error rate	65%	45%	69%	100%	90%	22%	42%	91%
Total Error	58%

Bayes Classifier Assuming Normal Distributions

The Bayes classifier did not perform very well when using the harmonic amplitudes as the features. This could have been predicted by looking at figure 2. The within-class variance is too high. In implementing the Bayes classifier, we needed to adjust for the a priori probabilities, which was not required with any other classifier (nearest-neighbor naturally treats each data point as equally-likely). Here is the confusion matrix:

	Bassoon	Alto Sax	Sopr. Sax	Oboe	Bb Clar	Flute	Horn	Eb Clar	Guess/Actual
Bassoon	28	0	6	10	3	9	1	0	0.50
Alto Sax	6	87	13	6	12	2	3	7	0.74
Sopr. Sax	7	25	70	3	1	0	2	0	0.57
Oboe	11	2	5	15	1	0	5	0	0.65
Bb Clar	0	2	1	0	10	1	0	5	0.16
Flute	56	60	87	21	81	51	7	15	5.82
Horn	3	3	5	5	0	2	26	2	1.02
Eb Clar	2	6	3	0	10	0	1	3	0.78
Error rate	75%	53%	63%	75%	92%	22%	42%	91%
Total Error	64%

Note that Bayes takes an even greater advantage of the higher a priori probability of the Flute. This is because there is a large overlap between the estimated normal distributions, and the Flute’s distribution is higher due to its higher a priority probability. Thus it chooses the Flute almost 6 times more than its natural frequency of occurrence (highlighted in cyan).

With Pitch Dependence

Dumb Classifier (no features other than pitch)

If only pitch is used as a feature, then the best classifier is one that looks at the number of notes each instrument has played at that pitch, and chooses that instrument. If two or more instruments have the number of reference notes with the same pitch, then it does not matter which one we choose (all are equally likely), and we just choose the first one (which is why the Bassoon and Alto Sax are chosen so often). He is the resulting confusion matrix:

	Bassoon	Alto Sax	Sopr. Sax	Oboe	Bb Clar	Flute	Horn	Eb Clar	Guess/Actual
Bassoon	45	36	36	8	14	37	23	17	1.91
Alto Sax	49	111	75	15	50	74	59	38	2.55
Sopr. Sax	10	28	68	20	32	44	23	35	1.37
Oboe	0	0	0	5	2	5	0	1	0.22
Bb Clar	0	0	0	0	0	0	0	0	0.00
Flute	1	8	11	12	20	67	3	28	0.66
Horn	8	2	0	0	0	0	23	0	0.25
Eb Clar	0	0	0	0	0	0	0	0	0.00
Error rate	60%	40%	64%	92%	100%	70%	82%	100%
Total Error	72%

Bayes Classifier Assuming Normal Distributions

The Bayes Classifier does not work very well using this method. Bayes needs at least D+1 samples per class to get a non-singular covariance matrix for that class. In order to get this method to work, we needed to reduce the number of features, and increase the pitch range, so as to not get a bunch of unclassifiable points. We reduced D to 7 (i.e we are using seven harmonics) and we increased the pitch range from ±1.5 to ±4.5. With these features, we get the following results, again using leave-one-out error estimation.

	Bassoon	Alto Sax	Sopr. Sax	Oboe	Bb Clar	Flute	Horn	Eb Clar	Guess/Actual
Bassoon	78	17	9	0	3	16	1	5	1.19
Alto Sax	3	116	1	2	4	14	7	2	0.81
Sopr. Sax	5	9	130	10	10	16	2	9	1.01
Oboe	0	1	3	32	1	0	0	1	0.63
Bb Clar	1	10	11	3	52	7	2	24	0.93
Flute	7	14	27	11	15	161	5	15	1.12
Horn	14	10	1	0	5	4	111	5	1.16
Eb Clar	0	8	8	2	28	9	1	58	0.96
Undecided	5	0	0	0	0	0	2	0
error rate	31%	37%	32%	47%	56%	29%	15%	51%
Total Error	35%

1-NN classifier

The 1-NN classifier using pitch dependence is the best classifier used. The pitch range used was ±1.5 semitones (1/12 of an octave) from the tested note’s pitch. This typically includes nine notes from each instrument (18 from instruments with vibrato notes) in the classification. Deviations from this are caused by errors in the preprocessor. Now we add the possibility of having no neighbors that are in the required pitch range, due to an error in pitch calculation. Thus no classification can be made, and the classifier returns an undecided result. (We could have made it return the Flute in this case, but not much would be gained since this situation only happens once for this case).

	Bassoon	Alto Sax	Sopr. Sax	Oboe	Bb Clar	Flute	Horn	Eb Clar	Guess/Actual
Bassoon	92	1	0	0	1	2	4	3	0.91
Alto Sax	2	178	1	0	8	4	4	4	1.09
Sopr. Sax	1	1	173	12	6	7	2	11	1.12
Oboe	0	0	1	33	2	2	0	1	0.65
Bb Clar	1	2	3	5	59	7	0	22	0.84
Flute	2	1	4	5	14	195	1	14	1.04
Horn	13	2	2	0	3	5	119	3	1.13
Eb Clar	2	0	6	5	25	5	0	61	0.87
Undecided	0	0	0	0	0	0	1	0
Error rate	19%	4%	9%	45%	50%	14%	9%	49%
Total Error	20%

Note that the classifier performs, except that it often confuses the Eb Clarinet with the Bb Clarinet (highlighted in cyan). This is expected since they both are Clarinets. If we group the two Clarinets together as one instrument we get:

	Bassoon	Alto Sax	Sopr. Sax	Oboe	Clarinets	Flute	Horn	Guess/Actual
Bassoon	92	1	0	0	4	2	4	0.91
Alto Sax	2	178	1	0	12	4	4	1.09
Sopr. Sax	1	1	173	12	17	7	2	1.12
Oboe	0	0	1	33	3	2	0	0.65
Clarinets	3	2	9	10	167	12	0	0.86
Flute	2	1	4	5	28	195	1	1.04
Horn	13	2	2	0	6	5	119	1.13
Undecided	0	0	0	0	0	0	1
error rate	19%	4%	9%	45%	30%	14%	9%
Total Error	16%

k-NN

For 2-NN we get the following result. Note that the number of undecided points has increased. This is because we now need two points of the same class to determine the winner. If these aren’t available, an undecided mark is recorded.

Bassoon	70	1	1	0	3	2	3	3	0.74
Alto Sax	6	153	3	0	10	6	7	5	1.03
Sopr. Sax	4	6	150	13	6	8	2	14	1.07
Oboe	0	0	3	28	3	0	0	1	0.58
Bb Clar	2	4	9	5	38	15	2	17	0.78
Flute	8	13	15	11	20	163	7	21	1.14
Horn	20	7	3	0	6	8	107	5	1.21
Eb Clar	2	1	6	3	32	25	1	53	1.03
Undecided	1	0	0	0	0	0	2	0
error rate	38%	17%	21%	53%	68%	28%	18%	55%
Total Error	33%

Dimensionality Reduction

Number of Harmonics

All of the above results (except where noted) were using 20 harmonics as features, along with the pitch, for classification. If we change the number of harmonics from 1 to 50, we used we get the following blue curve in figure 5. The error rate decreases as the number of harmonics increases until about 20 harmonics. After that, increasing the number of harmonics has little effect on the error rate.

Principle Component Analysis (PCA)

Using all 50 harmonics, we can also employ principle component analysis (PCA) to reduce our feature space. In figure 5, the green curve shows what happens as we reduce the dimensionality of the feature space according to PCA. The error rate was reduced to a low level using only 15 dimensions, and then leveled off about there. The problem with PCA is better performance is obtained just ignoring the higher harmonics instead of trying to incorporate them with PCA.

Multiple Discriminant Analysis (MDA)

Since we have more than one class, we can apply multiple discriminant analysis to reduce the number of features to the same as the number of classes (in this case, eight) or less. In figure 5, the red curve shows that MDA performed better than PCA for the first few dimensions, and then did worse. MDA must reduce the dimensionality down to eight, thus any larger dimensionality than 8 results in an eight dimensional hyper-plane in that space.

Conclusions

Using pitch dependant 1-NN classification results in a classifier that gives a leave-one-out error of 20% using the given instruments, or 16% if the two Clarinets are treated as one instrument. This is a very good result. Pitch dependent classification is a necessary for reasonable results. The optimal number of harmonics to use was around 20, with more harmonics only increasing the processing time. PCA and MDA did not perform well in dimensionality reduction.

The accuracy of using this classifier with data other than the training data remains very questionable. My hunch is that this classifier will perform much worse, especially if the recordings are made in an acoustically active room.

Matlab® code

To request the Matlab® code used for this project send email to Bryan Davis.

References

R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, Second edition, John Wiley & Sons, Inc., 2001

David M. Howard, James Angus, Acoustics and Psychoacoustics, Second edition, Focal Press, 2001

I. Kaminskyj, Multi-feature Musical Instrument Sound Classifier