Chapter 1 Voice Over IP (VoIP)

(1)

Voice Over IP (VoIP)

The terms Voice-over-IP (VoIP ) or IP telephony have a general meaning, grouping all the technologies made to allow bidirectional real-time transmis-sion of voice over an IP-based network.

1.1 Advantages of VoIP

The Public Switched Telephone Network (PSTN ) is dedicated infrastructure that is able to create an end-to-end bidirectional 64 kbps connection upon request, mainly for voice transmission. An IP network, on the other side, is a packet-switched network; thanks to this fact, it optimizes the use of resources. Moreover two separate infrastructures (for transport of data and voice) are no longer needed, thus lowering overall costs.

Moreover, usage of an IP network allows greater mobility of the telephone device: the VoIP telephone number is not geographically bound, permitting a user to be reachable even in different countries under the same number. Moreover connection to the Internet can be obtained through a wireless net-work, making a VoIP terminal “mobile” in a transparent way to the other party.

Form an enterprise point of view, VoIP allow easy integration with the information system, unifying the management of documents, messages and phone calls inside a single terminal, which can be desktop PC, a laptop or a

(2)

PDA.

VoIP also allows integration of telephony with the Web. For example, companies could offer a link in their website allowing visitors to call their customer service directly from the web browser.

1.2 Issues of VoIP

On the other side, one obstacle in deploying IP telephony, derives from its real-time nature, requiring timely delivery of voice data, whereas an IP-based network is a best-effort network, meaning that we do not have any guarantee on whether a packet will be delivered, and with how much delay.

In fact, all packets from different sources and different kinds of traffic are queued and routed together; consequently, transport of voice data is influence by other flows on the same network segment. This is not a problem for bulk data transfers, which do not need real-time guarantees and for which retransmission techniques are employed in case of packet loss; on the contrary, high delay, high packet loss or high packet arrival variance (jitter ) could degrade the quality of VoIP communication up to making a phone call not feasible.

Support for Quality of Service (QoS ) policies has been added to the IP protocol and to the underlying network layers. Unfortunately those policies are mostly unimplemented nowadays. Anyway, the always growing availabil-ity of bandwidth and speed of network devices, have made possible VoIP communication even without QoS guarantees.

1.3 Network performance requirements

Network performance is not the main subject of this work. However, the background of VoIP requirements is important to understand the guidelines of multimedia protocol design, in particular the importance of minimizing network hops, in order to reduce connection latency and jitter.

(3)

1.3.1 Delay

In real-time bidirectional communications, keeping the end-to-end delay small is very important.

Excessive delay in human voice communications causes two undesirable side-effects:

• Echo: it is caused by the signal reflections of the speaker’s voice from the far-end telephone equipment back into the speaker’s ear. Echo be-comes a significant problem when the round-trip delay bebe-comes greater than 50 milliseconds. As echo is perceived as a significant quality prob-lem, voice-over-packet systems must address the need for echo control and implement some means of echo cancellation.

• Talker overlap or Hello-effect : it is the problem of one talker stepping on the other talker’s speech; it becomes significant if the one-way delay becomes greater than 250 milliseconds.

The end-to-end delay is the sum of delays derived from multiple sources: 1. Accumulation Delay (or Algorithmic Delay): it is caused by the need to collect a frame of voice samples to be processed by the voice coder. It is related to the type of voice coder used and varies from a single sample time (0.125 microseconds) to many milliseconds;

2. Processing Delay: it is caused by the actual process of encoding and collecting the encoded samples into a packet for transmission over the packet network. The encoding delay is a function of both the processor execution time and the type of algorithm used. Often, multiple voice-coder frames will be collected in a single packet to reduce the packet network overhead;

3. Network Delay: it is caused by the physical medium and protocols used to transmit the voice data and by the buffers used to remove packet jitter on the receive side. Network delay is a function of the capacity of the links in the network and the processing that occurs as the packets

(4)

transit the network. This delay can be a significant part of the overall delay, as packet-delay variations can be as high as 70 to 100 milliseconds in some frame-relay and IP networks;

4. Jitter Reduction Delay: it is caused by the procedures used to reduce the effects of jitter, described in the next paragraph.

1.3.2 Jitter

The variable inter-packet timing caused by the network a packet traverses is called jitter. Bursty traffic patterns can cause high jitter.

Removing jitter requires collecting packets and holding them long enough to allow the slowest packets to arrive in time to be played at the correct time. This causes additional delay. For this reason, the jitter removal algorithms must choose wisely the size of the so called playback buffer. A big playback buffer would allow playback of late audio packets, but would add too much delay; on the other side, a small one would keep the delay small, but many audio packets would not be played, being considered lost packets.

1.3.3 Bandwidth

Bandwidth requirements depends mainly on the chosen codecs. A codec (COder/DECoder ) is an algorithm used to encode digital audio/video data before sending it through the network. Codecs are useful to achieve good audio quality while keeping bitrate low. This can be achieved by dynamic range compression (like for G.711), but also with some complex compression techniques based of knowledge and modeling of the audio source, i.e. the human vocal apparatus (like for GSM codec).

The effective bandwidth requirement for a particular codec is higher that the bitrate of the codec. In fact, overhead of all the involved protocols (RTP, UDP, IP, Level 2 protocol) must be considered, in addition to the fact that bandwidth required is double in case of bidirectional conversation.

In table 1.1 the most common audio codecs are listed, along with bitrate and Ethernet bandwidth requirement in the case of bidirectional

(5)

communi-cation and packetization time of 30 ms. Codec Bitrate Ethernet Notes

(kbps) bandwidth

G.711 64.0 158.93 µ-Law or A-law PCM, used by PSTN

iLBC 13.3 57.53 Internet Low Bit Rate Codec, a

roy-alty free narrowband speech codec, de-veloped by Global IP Sound

GSM 12.2 55.33 Used by the GSM network

G.729 8.0 46.94 High quality low bitrate codec G.723.1a 5.3 41.53 Very low bitrate codec

Table 1.1: Speech codecs

1.3.4 Packet Loss

Another network performance parameter that highly influences the quality of the communication is the percentage of packets that get lost. Contributions to packet loss come from transmission errors and congested routers, but also packets that arrive at the receiver too late for playback have the same effect of lost ones.

Generally speaking, packet loss up to 2% has not too relevant effect of the quality of the conversation, whereas in case of packet loss of 5% the degradation of perceived quality is significant.

Perceived quality in case of packet loss, however, is dependent on the employed codec and on the algorithm used to reduce the effect of packet loss (called Packet Loss Concealment algorithm [47]).

1.4 Signaling protocols

The first VoIP application were proprietary solutions. In order to allow interoperability between products of different vendors, both the ITU-T1 _and

the IETF2 _{have been working on the standardization of protocol to be used}

in IP telephony.

1_{International Telecommunication Union – Telecommunication Standardization Sector} 2_{Internet Engineering Task Force}

(6)

The first widely adopted standard was H.323 (by ITU-T), an umbrella recommendation that defines the protocols to provide audio-visual communi-cation sessions on any packet network. It was firstly published in 1996; later features and improvements were standardized until the latest specification [40] in 2003.

H.323 is a very complex specification, which takes in consideration signal-ing between endpoints and network devices, media transport protocols (RTP and RTCP, see 1.5.1), audio and video codecs, data transfer protocols.

On the other side, IETF worked on the specification of the Session Ini-tiation Protocol (SIP [72]), an HTTP-like signaling protocol designed with simpleness and flexibility in mind. Although SIP was published (1999) later than H.323, it has now reached a wider diffusion, mainly due to the fact it is simpler than H.323.

Only SIP is being considered in this work, and one chapter (2) is dedicated to it.

1.5 Media transport protocols

From the considerations about network performance parameters, it it clear that media traffic is not too sensitive to packet loss, but typically very sensi-tive to delays. For this reason, UDP is a better choice over TCP for convey-ing media packets; in fact, TCP features of automatic retransmission of lost packets and in-order reception would have the effect of degrading perceived quality.

Both SIP and H.323 rely on Real-time Transport Protocol (RTP ) for media stream transport.

1.5.1 RTP

The Real-time Transport Protocol (RTP ) defines a standardized packet for-mat for delivering audio and video over the Internet. It was developed by the Audio-Video Transport Working Group of the IETF and first published in 1996 as RFC 1889, which was made obsolete in 2003 by RFC 3550[34].

(7)

RTP flows over UDP and provides the following services:

• payload-type identification: indication of what kind of content is being carried; actually less useful when used with SIP or H.323, because the employed codec is specified by other means;

• sequence numbering: PDU3 _{sequence number, used for re-ordering of}

packets on the receiving application;

• time stamping: presentation time of the content being carried in the PDU, necessary to synchronize the coder and the decoder;

• multicast support.

This protocol does not provide timely nor in-order delivery. Moreover flow and congestion control, and QoS guarantees are not supported; they must be provided by some other means.

RFC 3711[5] defines the Secure Real-time Transport Protocol (SRTP ) pro-file (actually an extension to RTP Propro-file for Audio and Video Conferences) which can be used to provide confidentiality, message authentication, and replay protection for audio and video streams being delivered.

1.5.2 RTCP

Beside RTP, another protocol is used to convey streaming information: the RTP Control Protocol (RTCP ), defined in RFC 3550[34].

RTCP provides periodical out-of-band control information for an RTP flow. The primary function of RTCP is to provide feedback on the quality of service being provided by RTP.

It gathers statistics on a media connection and information such as bytes sent, packets sent, lost packets, jitter, feedback and round trip delay. An application may use this information to increase the quality of service perhaps by limiting flow, or maybe using a low compression codec instead of a high compression codec.