Report

Introduction

Voice over IP has become very popular because of the Internet, where bandwidth limitations make it necessary to compress the speech signal. Few years ago, "Internet Telephone" was a hot topic in the world. This piece of software could allegedly transmit speech between two peoples on the Internet in real time.

In this project, a kind of Internet phone software is developed by using some speech coding techniques. This software uses TCP/IP protocol to transmit speech and audio through Internet in real time between two PCs. The software is developed using Delphi 6.0 in this project. Delphi is a sophisticated Windows programming environment, suitable for beginners and professional programmers alike. Using Delphi you can easily create self-contained, user friendly, highly efficient Windows applications in a very short time. That is why we chose Delphi as our development tools in this project.

First problem we met is how to use TCP/IP transmit data through Internet. Since in this project we are mostly focus on speech coding techniques, we chose using Delphi's Winsocket to solve this problem. Winsocket has two components: ClientSocket and ServerSocket, they work in client and server respectively. Using communication between these two component, we can add some application programming code. Then we can implement a communication program easily.

Second problem is which kind of coding techniques we should use. The LPC Vocoder is the first choice of ours. We used matlab to simulate the LPC Vocoder. But the result is not very good. It took long time to get result and the quality was bad. Then Linear Pulse Code Modulation (PCM) was chosen. And the quality is better. Some discussion of the LPC will be presented in this report.

<<top

LPC

Introduction

Linear Predictive Coding (LPC) is one of the most powerful speech analysis techniques, and one of the most useful methods for encoding good quality speech at a low bit rate. It provides extremely accurate estimates of speech parameters, and is relatively efficient for computation. Here we describe the basic ideas behind linear prediction, and discusses some of the issues involved in its use. The following figure shows a simplified block diagram of this model.

Basic Principles

LPC starts with the assumption that the speech signal is produced by an excitation at the end of a tube. The glottis (the space between the vocal cords) produces the excitation, which is characterized by its intensity (loudness) and frequency (pitch). The vocal tract (the throat and mouth) forms the tube, which is characterized by its resonances, which are called formants.

LPC analyzes the speech signal by estimating the formants, removing their effects from the speech signal, and estimating the intensity and frequency of the remaining excitation. The process of removing the formants is called inverse filtering, and the remaining signal is called the residue.

The numbers which describe the formants and the residue can be stored or transmitted somewhere else. LPC synthesizes the speech signal by reversing the process: use the residue to create a source signal, use the formants to create a filter (which represents the tube), and run the source through the filter, resulting in speech.

Because speech signals vary with time, this process is done on short chunks of the speech signal, which are called frames. Usually 30 to 50 frames per second give intelligible speech with good compression.

Estimating the Formants

The basic problem of the LPC system is to determine the formants from the speech signal. The basic solution is a difference equation, which expresses each sample of the signal as a linear combination of previous samples. Such an equation is called a linear predictor, which is why this is called Linear Predictive Coding.

The coefficients of the difference equation (the prediction coefficients) characterize the formants, so the LPC system needs to estimate these coefficients. The estimate is done by minimizing the mean-square error between the predicted signal and the actual signal.

This is a straightforward problem, in principle. In practice, it involves (1) the computation of a matrix of coefficient values, and (2) the solution of a set of linear equations. Several methods (autocorrelation, covariance, recursive lattice formulation) may be used to assure convergence to a unique solution with efficient computation.

Encoding the Source

If the predictor coefficients are accurate, and everything else works right, the speech signal can be inverse filtered by the predictor, and the result will be the pure source (excitation). For such a signal, it's fairly easy to extract the frequency and amplitude and encode them.

However, some consonants are produced with turbulent airflow, resulting in a hissy sound (fricatives and stop consonants). Fortunately, the predictor equation doesn't care if the sound source is periodic (excitation) or chaotic (hiss).

This means that for each frame, the LPC encoder must decide if the sound source is excitation or hiss; if excitation, estimate the frequency; in either case, estimate the intensity; and encode the information so that the decoder can undo all these steps.

The Problem:

1. The tube isn't just a tube

It may seem surprising that the signal can be characterized by such a simple linear predictor. It turns out that, in order for this to work, the tube must not have any side branches. (In mathematical terms, side branches introduce zeros, which require much more complex equations.)

For ordinary vowels, the vocal tract is well represented by a single tube. However, for nasal sounds, the nose cavity forms a side branch. Theoretically, therefore, nasal sounds require a different and more complicated algorithm. In practice, this difference is partly ignored and partly dealt with during the encoding of the residue (see below).

2. Problem: the excitation isn't just white noise

Unfortunately, things are not so simple. One reason is that there are speech sounds which are made with a combination of excitation and hiss sources (for example, the initial consonants in "this zoo" and the middle consonant in "azure"). Speech sounds like this will not be reproduced accurately by a simple LPC encoder.

Another problem is that, inevitably, any inaccuracy in the estimation of the formants means that more speech information gets left in the residue. The aspects of nasal sounds that don't match the LPC model (as discussed above), for example, will end up in the residue. There are other aspects of the speech sound that don't match the LPC model; side branches introduced by the tongue positions of some consonants, and tracheal (lung) resonances are some examples.

Therefore, the residue contains important information about how the speech should sound, and LPC synthesis without this information will result in poor quality speech. For the best quality results, we could just send the residue signal, and the LPC synthesis would sound great. That's so called CELP. Since it a little bit complex, we didn't implement this.

<<top

TCP/IP

TCP/IP (Transmission Control Protocol/Internet Protocol) is the basic communication language or protocol of the Internet. TCP/IP is a two-layer program. The higher layer, Transmission Control Protocol, manages the assembling of a message or file into smaller packets that are transmitted over the Internet and received by a TCP layer that reassembles the packets into the original message. The lower layer, Internet Protocol, handles the address part of each packet so that it gets to the right destination
TCP/IP uses the client/server model of communication in which a computer user (a client) requests and is provided a service (such as sending a Web page) by another computer (a server) in the network. TCP/IP communication is primarily point-to-point, meaning each communication is from one point (or host computer) in the network to another point or host computer. In our program, the software is working under client/server model.

Client-Server Model

The fundamental pattern of activity on the Internet consists of one program requesting another program to provide a service. The two programs may be on the same network or on different networks. The requesting program is called the client and the program providing the service is called the server.

The process of the client is (typically) simple in comparison on the server, which is illustrated in the following figure. The client simply interfaces the users (by way of a device driver) and makes a request to the server. The request is made on a "well known" port that only the desired server monitors.

The general flow of activities for the server is divided between master and slave function. The master function receives the request and places a work request in queue for the slave function. It follows that the slave function does all the work, including the response for the client. The server performs the following master function:

1. Open a fixed port for the service;
2. Waits for a new client request;
3. Places any received client request in queue for the slave and returns to wait for a new client request.

The server does the following slave functions.
1. Performs service requested;
2. Builds and sends message to client.

<<top

Conclusion

This software was tested in different Windows system, such as Win 2000 and Win NT 4.0. When we tested the software on the LAN net, the sound quality is clear due to LAN has wide bandwidth. When we tested the software using modem to connect to Internet, the sound had some delay since the bandwidth is lower. If you us this software behind firewall, it may not work well. In the future, we can use some more complex coding techniques to improve the quality of sound.

Reference:

[1] Adaptive Filter Theory, Simon Haykin, Pearson Education, 2002.
[2] Mastering Delphi3, Marco Cantu
[3] Audio Compression Manager, Peter Morris, 2002.

<<top