Microsoft Real-Time Communications: Protocols and Technologies

Article
09/11/2009

By Ross Carter

Abstract

This paper is written for IT professionals and developers interested in understanding the concepts, protocols, and technologies of real-time communications. It describes protocols such as the Internet Engineering Task Force (IETF) Session Initiation Protocol (SIP), SIP Instant Messaging and Presence Language Extensions (SIMPLE), and Real-time Transport Protocol (RTP). Microsoft uses these protocols and related technologies to provide a real-time communications (RTC) platform for corporate multi-modal communication, which includes voice and video communication, instant messaging, application sharing, and collaboration. Throughout this paper, voice communication and the way the Microsoft® Windows® XP operating system supports it are used to illustrate how the underlying technologies work.

Introduction

The Microsoft® real-time communications platform is based on Microsoft’s commitment to supporting industry communication standards. Windows XP supports Internet Engineering Task Force (IETF) Session Initiation Protocol (SIP), SIP Instant Messaging and Presence Language Extensions (SIMPLE), and Real-time Transport Protocol (RTP). These protocols and associated technologies are designed to address the specific needs of real-time communication over of a packet-switched network, whether the communication takes the form of voice, video, or instant messaging.

This paper briefly describes the voice communication process used on a circuit-switched voice network and then focuses on RTC voice communications to illustrate how the underlying technologies are used to enable real-time communications over a packet-switched network.

RTC Call Processing

In the process of transmitting real-time communications from one point to another, multiple steps are involved and various protocols are used. First, some type of signaling and call control is needed to establish, modify, and terminate a call. Within the public switched telephone network (PSTN), a circuit-switched network, Signaling System 7 (SS7) is used for call setup and termination. For packet-based networks, both the SIP and H.323 protocols provide call control. For information about SIP, see “Session Initiation Protocol (SIP)” later in this paper. For information about H.323, see “Telephony Integration and Conferencing” in the Windows 2000 Internetworking Guide of the Microsoft® Windows® 2000 Server Resource Kit.

After the calling session is established, the audio or video input needs to be sampled and converted to a digital format. Next, the sampled data is encapsulated into Real-time Transport Protocol (RTP) packets. RTP is specifically designed for the needs of real-time communication over a packet-based network.

Then, the RTP packet is encapsulated into a network transport protocol, which is most often the User Datagram Protocol (UDP). Alternatively, the Transmission Control Protocol (TCP) can be used for encapsulation; however, because TCP is a guaranteed transport-level protocol, the additional time needed to occasionally retransmit TCP packets can add enough latency (the time between sending and receiving packets) to the transmission so that the received audio is unintelligible. Throughout the transmission of the RTP packets, the Real-time Control Protocol (RTCP) is used to monitor the quality of an RTP session. For information about RTP and RTCP, see “RTP and RTCP” later in this paper.

Next, the network transport protocol, UDP or TCP, is encapsulated into an IP packet, which is then encapsulated into the link layer protocol — Ethernet, for example. The link layer packet is then transmitted to the destination computer(s). Figure 1 shows the encapsulation process, from the encapsulation of the RTP packet to encapsulation of the link layer packet.

Figure 1: Real-Time Communication Protocol Encapsulation

Session Initiation Protocol

Session Initiation Protocol (SIP), which is similar to the HyperText Transfer Protocol (HTTP), is a text-based application-layer signaling and call control protocol. SIP is used to create, modify, and terminate SIP sessions. It supports both unicast and multicast communication. Because SIP is text-based, implementation, development, and debugging are easier than with H.323.

Note: Windows Messenger is a SIP-based application. Windows XP does not support SIP through Telephony Application Programming Interface (TAPI). For more information about Windows Messenger, see “Using Windows Messenger” in Help and Support Center for Windows XP Professional and “Configuring Telephony and Conferencing” in the Microsoft® Windows® XP Professional Resource Kit Documentation.

SIP Components

The main components of a SIP environment fall into two primary categories, SIP servers and SIP user agents.

SIP Servers

There are three types of SIP servers: proxy, registrar, and redirect. Each type of server performs a different function, as noted in Table 1. The specific function the server performs determines which SIP requests it processes.

Table 1 SIP Servers

SIP Server	Function
Proxy server	Acts as an intermediary between a SIP user agent client and a SIP user agent server. The proxy server performs the functions of either a SIP user agent client or a SIP user agent server, depending upon the direction of the communication between client and server. A proxy server can simply forward the SIP request or modify it before sending it on.
Registrar server	Receives REGISTER requests, which contain both the IP address and the SIP address — the Uniform Resource Locator (URL) — of the user agent. This allows the registrar server to keep track of the location of user agents from which the registrar server has received REGISTER requests.
Redirect server	Accepts initiation, in the form of a SIP INVITE request, of a SIP session from the calling user agent, obtains the correct SIP address of the called user agent, and replies to the calling user agent with the correct SIP address. The calling user agent then uses the correct SIP address to directly initiate a SIP session with the called user agent.

SIP servers — proxy, registrar, and redirect — can be developed as separate applications or as a single application combining the functionality of all the servers. The combination of a registrar and proxy server is sometimes referred to as a rendezvous server.

SIP User Agents

Table 2 lists the two types of SIP user agents and what they do.

Table 2 SIP User Agents

SIP User Agent	Function
User agent client	Initiates SIP requests
User agent server	Receives SIP requests

Each user agent is associated with a SIP address.

SIP Call Flow

The call flow for SIP sessions depends upon whether the SIP session is established directly between SIP user agents or whether a SIP server (proxy, registrar, or redirect) is located between SIP user agents.

Figure 2 shows the typical call flow between two user agents, with each step noted in parentheses. First, user agent A sends out an INVITE request to initiate a call. User agent B then replies with the Trying response code (100), indicating that the call request is being processed. User agent B then replies with the OK response code (200), indicating that that user agent has accepted the call. User agent A then replies to user agent B with an acknowledgement (ACK) request, indicating that user agent A received the final response code from user agent B. The real-time data is then encapsulated in RTP packets (as described in “RTP and RTCP” later in this paper) and sent between user agent A and user agent B. Either user agent A or user agent B can then send a BYE request, indicating the that the user agent wants to terminate the session. User agent B then sends an OK response code (200) to user agent A to indicate that the request has succeeded.

Figure 2: User Agent SIP Call Flow

Figure 2: User Agent SIP Call Flow

Figure 3 shows the typical call flow when a proxy server is between the paths of two user agents. The proxy server essentially acts as a communication midpoint, functioning as both a user server and as a user agent. When acting as a user server the proxy receives the SIP requests and forwards them on to the destination user agent. When acting as a user agent the proxy receives the SIP responses and forwards them on to the destination user agent.

Figure 3: Proxy Server SIP Call Flow

Figure 4 illustrates the typical call flow between a user agent and a registrar server. The registrar server accepts REGISTER requests from the user agent, indicating the addresses at which the user agent can be reached. A registrar server is typically located with a proxy or redirect server.

Figure 4: Registrar Server SIP Call Flow

Figure 4: Registrar Server SIP Call Flow

Figure 5 shows the typical call flow when a redirect server is between two user agents. User agent A sends out an INVITE request to initiate a call. The redirect server then replies with the Moved response code (302), indicating that user agent B has temporarily moved. User agent A replies with an ACK request, indicating that user agent A received the response code from the redirect server. User agent A then sends another INVITE request directly to the newly acquired address for user agent B.

Figure 5: Redirect Server SIP Call Flow

Sample SIP Architecture

To illustrate how communication is handled among SIP components and how SIP components can fit into a network environment, Figure 6 shows a sample SIP architecture.

Figure 6: Sample SIP Architecture

A. Datum Corporation has two SIP proxy servers that direct SIP requests between domains within the company. The SIP proxy server connected to the firewall handles all SIP messages sent to recipients outside the company and all messages sent to recipients within the company from outside. For example, a SIP INVITE message sent from a SIP client in A. Datum Corporation to a SIP client in Fabrikam, Inc. would be sent to the SIP proxy server in Fabrikam, Inc.

The SIP proxy server then forwards the SIP INVITE request to the destination SIP client computer, or the SIP IP phone, in the domain of the SIP proxy server in Fabrikam, Inc. For example, the SIP server in Fabrikam, Inc. might receive a SIP INVITE request sent with a SIP URL in the format of a global phone number. If the global phone number has the destination of a SIP IP phone in Fabrikam, Inc., then the SIP INVITE request will be forwarded directly to the SIP IP phone. On the other hand, if the global phone number has the destination of a non-SIP IP phone, such as an analog phone, then the SIP INVITE request will be forwarded to the SIP/PSTN gateway, which formats the SIP INVITE request for the PSTN. Using the global phone number, the organization’s private branch exchange (PBX) determines whether to route the call to an analog phone within the company, or to route it to the PSTN for an analog phone outside the company.

SIP Protocol

SIP messages are based on the standard Internet message format, as described in RFC 822, “Standard for the Format of ARPA Internet Text Messages,” which you can find on the Web Resources page at https://www.microsoft.com/windows/reskits/webresources/. SIP messages are either requests from a client to a server or responses from a server to a client.

Each SIP message has three parts, as shown in Table 3.

Table 3 SIP Message Parts

SIP Message Part	Definition
Start line	Contents depend on whether the message is a request or a response. Both request and response start lines include the SIP version. Request start lines also include the method type and the SIP address or general URL of the destination receiving the request. Response start lines also include the numeric status-code and reason-phrase that define the response to the request.
Headers	Contains the header type and associated variable(s).
Message body	Contains information provided by the Session Description Protocol (SDP), such as the description of the media capabilities for the SIP session.

SIP defines the values for the start line and headers. The Session Description Protocol (SDP) defines the values for the message body.

SIP Message Start Line

The syntax for the start line, as shown in Table 4, depends on whether the message is a request or a response.

Table 4 Start Line Syntax

Start Line	Syntax
Request	Method Request-URI SIP-Version
Response	SIP-Version Status-Code Response-Phrase

Request Method

The first item in a request start line is the SIP method, a signaling command. The SIP methods, listed in Table 5, are defined in RFC 3261, the Internet draft “SIP Extensions for Presence” and the Internet draft “SIP Extensions for Instant Messaging.”

Table 5 SIP Methods and Their Functions

SIP Method	Function
INVITE	Request to initiate a SIP session. INVITE is sent from the calling party to the called party.
ACK	The called party has accepted the call. ACK is sent from the calling party to the called party.
OPTIONS	The calling party is requesting the called party to respond with its capabilities.
BYE	Request to terminate the session. BYE can be sent by the calling or the called party. It is not necessary for the party receiving the BYE to respond with a BYE.
CANCEL	Cancels pending requests.
REGISTER	The calling party wants to register its current location with a registrar server.
SUBSCRIBE	The calling party requests an update regarding the presence information of the called party.
NOTIFY	Conveys the updated status of self to subscribed parties of self.
MESSAGE	Used to send an instant message.

Request URI

The second item in a request start line is a Request-Uniform Resource Identifier (URI), which contains the URL of the called party. Generally, the URL is a SIP URL. A SIP URL can have one of several formats. Some of the supported formats are listed in Table 6. For a complete list of available SIP URL formats and syntax, see RFC 3261, “SIP: Session Initiation Protocol*”* on the Web Resources page at https://www.microsoft.com/windows/reskits/webresources/.

Table 6 Partial List of SIP Request URL Formats

SIP URL Format	Explanation
sip:user@reskit.com	Basic SIP URL.
sip:user@reskit.com;transport=TCP	Basic SIP URL with the transport protocol designation of TCP. If the transport protocol is not designated it defaults to UDP.
sip:user@172.16.20.54	SIP URL with an IP address.
sip:+1-425-707-9796@reskit.com;user=phone	SIP URL with a global phone number.
sip:marketing@reskit.com;maddr=225.0.2.1;ttl=64	SIP URL with a multicast address, which overrides the previously specified host name. The time-to-live (TTL) value is set to 64 (0-255). The TTL must be set when using a multicast address and UDP as the transport protocol.

Request or Response Version

The final item in the request start line and the first item in the response start line is the SIP version, which is currently version 2.0.

The following sample SIP request message, taken from a Windows Messenger session, shows a typical SIP request line.

Response Status Code

There are six categories of status code: informational, success, redirection, client error, server error, and global failures. The left-most digit of the status code, as shown in Table 7, indicates the code’s category.

Table 7 SIP Response Status-Codes

Status Code	Response Category	Description
1xx	Informational	The request was received and is being processed.
2xx	Success	The requested action was successfully understood and accepted.
3xx	Redirection	Further action is needed in order to complete the request.
4xx	Client Error	The request contains incorrect syntax or cannot be fulfilled by the server.
5xx	Server Error	The server received the request, but that it is unable to process it, though another server might be able to do so.
6xx	Global Failures	The server receiving the request is unable to process it and the request would fail on other servers also. Therefore, the request should not be forwarded.

Response Phrase

All SIP response codes defined in SIP version 2.0, and their corresponding categories and response phrases, are listed in Table 8.

Table 8 SIP Response Status Codes and Phrases

Status Code	Response Category	Response Phrase
100	Informational	Trying
180	Informational	Ringing
181	Informational	Call is being forwarded
182	Informational	Queued
200	Success	OK
300	Redirection	Multiple choices
301	Redirection	Moved permanently
302	Redirection	Moved temporarily
303	Redirection	See other
305	Redirection	Use proxy
380	Redirection	Alternative service
400	Client Error	Bad request
401	Client Error	Unauthorized
402	Client Error	Payment required
403	Client Error	Forbidden
404	Client Error	Not found
405	Client Error	Method not allowed
406	Client Error	Not acceptable
407	Client Error	Proxy authentication required
408	Client Error	Request timeout
409	Client Error	Conflict
410	Client Error	Gone
411	Client Error	Length required
413	Client Error	Request entity too large
414	Client Error	Request-URI too large
415	Client Error	Unsupported media type
420	Client Error	Bad extension
480	Client Error	Temporarily not available
481	Client Error	Call leg/transaction does not exist
482	Client Error	Loop detected
483	Client Error	Too many hops
484	Client Error	Address incomplete
485	Client Error	Ambiguous
486	Client Error	Busy here
500	Server Error	Internal server error
501	Server Error	Not implemented
502	Server Error	Bad gateway
503	Server Error	Service unavailable
504	Server Error	Gateway time-out
505	Server Error	SIP version not supported
600	Global Failures	Busy everywhere
603	Global Failures	Decline
604	Global Failures	Does not exist anywhere
606	Global Failures	Not acceptable

SIP Message Headers

The start line of a SIP message is followed by one or more headers. The included headers depend upon whether the message is a response or a request. Headers are defined in RFC 3261, “SIP: Session Initiation Protocol,” which you can find on the Web Resources page at https://www.microsoft.com/windows/reskits/webresources/.

As shown in Table 9, headers fall into four categories: general, request, response, and entity. Headers in the general category can be used for both request and response messages.

Table 9 SIP Headers

General	Request	Response	Entity
Accept	Authorization	Allow	Content-encoding
Accept-encoding	Contact	Proxy-authenticate	Content-length
Accept-language	Hide	Retry-after	Content-type
Call-ID	Max-forwards	Server
Contact	Organization	Unsupported
Cseq	Priority	Warning
Date	Proxy-authorization	WWW-authenticate
Encryption	Proxy-require
Expires	Route
From	Require
Record-route	Response-key
Time stamp	Subject
To	User-agent
Via

The following sample SIP request, taken from a Windows Messenger session, highlights the SIP headers:

The body of a SIP message is defined by the Session Description Protocol (SDP).

Session Description Protocol

The Session Description Protocol (SDP) is an IETF standard for announcing and describing multimedia conferences. The SIP message body contains a session description, as defined by the SDP. A session description consists of three parts: a single session description, zero or more time descriptions, and zero or more media descriptions. The session description contains global attributes that apply to the whole conference or all media streams. Time descriptions contain conference start, stop, and repeat time information. Media descriptions contain details about a particular media stream. Table 10 lists the SDP types and associated description values that can be used in each of the three parts of an SDP message.

Table 10 SDP Descriptions

Session	Time	Media
Type	Value	Type	Value	Type	Value
v	Protocol version	t	Time the session is active	m	Media name and transport address
o	Owner/creator and session identifier	r	Zero or more repeat times	i	Media title
s	Session name			c	Connection information
i	Session information			b	Bandwidth information
u	Uniform Resource Identifier (URI) of description			k	Encryption key
e	E-mail address			a	Zero or more media attribute lines
p	Phone number
c	Connection information
b	Bandwidth information
z	Time zone adjustments
a	Zero or more session attribute lines

The following sample SIP request message, taken from a Windows Messenger session, highlights the SIP message body:

Audio and Video Digitization and Compression

After the call has been set up with SIP, the data must be digitized and compressed. In order to transmit audio and video data, which are inherent in an analog format, across the wire on a packet-based network, the analog waveforms must be converted into digital values. Once the data is in digital format, a software-based codec (coder-decoder) is used to compress the data, which allows for better network utilization and improved voice quality.

Audio Digitization

Converting audio signals to digital format involves several steps. First, the waveform, which represents the audio input, is sampled at regular intervals, as shown in Figure 7.

Figure 7: Periodic Waveform Sampling

Figure 7: Periodic Waveform Sampling

The sampling rate — the frequency with which the samples are taken — depends upon the type of audio media being sampled and on the codec and associated coding algorithm used. For example, PSTN, which uses the compounded pulse code modulation (PCM) coding algorithm, has a voice sampling rate of 8 kHz, where a Hz equals one cycle per second.

Sampling rate is derived from the Nyquist criterion:

Fs > 2×BW
Fs = sampling frequency
BW = bandwidth of input analog voice signal

The Nyquist criterion states that sampling must occur at least twice as often as the number that represents the highest frequency sampled. Because most analog voice signals fit approximately within the bandwidth range of 4 kHz, the sampling rate of 8 kHz is deemed sufficient for most voice communications.

After sampling the data, the next step is to identify the interval into which each sampling of the waveform falls. This process, shown in Figure 8, is called quantization.

Figure 8: Quantization

Figure 8: Quantization

After the data has been sampled and quantized, an 8-bit code word is assigned to each sample for transmission. Each 8-bit code word is subsequently transmitted through the network. Figure 9 depicts the transmission of the first three samples of the quantization shown in Figure 8.

Figure 9: Digital Signal Transmission

Hence, we derive the 64 Kbps bandwidth (8 kHz x 8 bits per sample) required for each analog transmission (voice or data) over the PSTN switched circuit network.

Audio and Video Compression

Audio and video codecs use algorithms to compress the digitized audio and video signals before the sender transmits them, and then to decompress them on the receiving computer before they are played for the user. Using a codec for compression and decompression reduces network bandwidth utilization and minimizes network traffic load.

The conversion from analog to digital form and from digital back to analog form is performed by hardware. For example, the data is already digitized, but in a less compressed format, by the time it is received by the source filter. Figure 10 shows how codecs are used for video compression and decompression.

Figure 10: Video Compression and Decompression

Windows XP supports audio codecs for both SIP and H.323 IP telephony applications, as shown in Table 11.

Table 11 Audio Codecs Supported by Windows XP

Audio Codec	Sampling Rate	Bit Rate	Frame Size	Encoding Algorithm
DVI4	8 kHz	32 Kbs	20 ms	ADPCM
G.711	8 kHz	64 Kbs	20 ms	PCM (MuLaw) (aLaw)
G.722.1 (SIP only)	16 kHz	24 Kbs	20 ms or 40 ms	MLT
G.723.1	8 kHz	5.3 Kbs / 6.3 Kbs	30 ms, 60 ms, or 90 ms	CELP
GSM6.10 (SIP only)	8 kHz	13 Kbs	20 ms	RPE-LTP
SIREN (SIP only)	16 kHz	16 Kbs	20 ms or 40 ms	MLT

Windows XP supports video codecs for both SIP and H.323 IP telephony applications, as shown in Table 12.

Table 12 Video Codecs Supported by Windows XP

Codec	Bit Rate	Encoding Algorithm
H.261	64 kbs-256 Kbs	DCT
H.263	16 kbs-256 Kbs	DCT

Audio Bandwidth Capacity

The codec used and its supporting quantization and compression algorithms determine the bandwidth needed to transmit voice and video data. For example, each analog voice call using PSTN requires 64 Kbs of bandwidth. This is derived from the encoding and compression algorithms used with companded PCM, which provides high quality for both voice and data.

One of the advantages of using IP telephony is the ability to utilize the latest improvements in codec technology. As noted above, one voice call over the PSTN uses a bit rate of 64 Kbs. Approximately 10 voice calls can be placed at the same bit rate on a packet-switched network when using the G.723.1 codec, which employs the Code Excited Linear Predictive (CELP) encoding algorithm. IP telephony also offers codecs, such as 16-kHz codecs, that provide better quality than the 8-bit PSTN codec and that require less bandwidth than a PSTN call.

Note: Using a 16-kHz sampling rate increases the network requirement to 128 Kbps. With recent advances in audio codec and network technology, however, a 16-kHz sampling rate is not that expensive on an IP network.

RTP and RTCP

After the data has been optimized for transmission over the packet-based network through digitization and compression, it is encapsulated within RTP. RTP is a real-time transport protocol, and RTCP is a control protocol used for monitoring RTP sessions. RTP and RTCP, defined in RFC 1889, “RTP: A Transport Protocol for Real-time Communications,” were designed by IETF specifically to address the needs of real-time communication over a packet-based network. For more information about RTP and RTCP, see RFC 1889 on the Web Resources page at https://www.microsoft.com/windows/reskits/webresources/default.asp.

Both SIP and H.323 make use of RTP for transferring digitized audio and video data between the various parties participating in a call. Each RTP packet contains one or more media payloads and other relevant information, such as time stamps and sequence numbers.

Typically, RTP and RTCP are used with UDP as the underlying transport layer and with IP as the underlying network layer. RTP uses dynamic UDP ports negotiated between the sender and receiver of specific media streams. However, RTP and RTCP are independent of the underlying transport and network layers and need not be used with UDP and IP as the transport and network protocols.

RTP

RTP provides end-to-end network transport for real-time applications, such as Windows Messenger and Phone Dialer. RTP contains information about the real-time session so applications can easily adjust for jitter, improper packet sequencing, and dropped packets. Much of this information is included in the RTP header.

Figure 11 shows the structure of an RTP packet.

Figure 11: RTP Packet Structure

Figure 11: RTP Packet Structure

Version	Identifies the version of RTP. Windows XP supports version 2.
Padding	If set to 1, then one or more additional padding octets have been appended to the end of the payload. The first padded octet indicates the number of additional octets that are included.
Extension	If the extension bit is set, then there is an extension header appended to the fixed RTP header.
CSRC count	Lists the number of Contributing Source (CSRC) identifiers that follow the fixed RTP header.
Marker	The RTP profile determines the definition and use of the Marker bit.
Payload type	Defines the RTP payload type.
Sequence number	The initial sequence number starts with a random value and increases by increments of one for each RTP packet sent. This value can be used by real-time applications to determine packet loss and to restore proper packet sequencing.
Timestamp	The timestamp value represents the sampling instant of the first octet of the RTP packet. The sampling frequency used depends upon the data type. For example, when Windows XP uses the G.711 voice codec, the sampling frequency is set at 8 kHz.
Synchronization source (SSRC)	The SSRC value, which initiates as a randomly selected number, identifies the source of the RTP stream for each RTP session.
Contributing source (CSRC)	The CSRC value represents a source of multiple contributors to an RTP session, where the SSRC value of each source is added to the CSRC value by an RTP mixer.

RTCP

RTCP packets contain information regarding the quality of the RTP session and the individuals participating in the session. Both sender(s) and receiver(s) periodically transmit RTCP packets to each participant in an RTP session. A real-time application can use this information to monitor the quality of the RTP session; for example, to monitor jitter and packet loss.

There are five RTCP packet types, as shown in Table 13:

Table 13 RTCP Packet Types

SR (Sender Report)	Contains information regarding the quality of the RTP session.
RR (Receiver Report)	Contains information regarding the quality of the RTP session.
SDES (Source Description)	Contains information regarding the identity of each participant in the RTP session.
BYE (Goodbye)	Indicates that one or more sources are no longer active in the RTP session.
APP (Application-defined)	For experimental use by new applications.

Participants in an RTP session send RR packet types, and, if they are active senders, send SR packet types. The RR packet has two sections, the header and report blocks, as shown in Figure 12. There is one report block for each source.

RTCP RR Packet Sections
Header
Report Block 1
Report Block…n

Figure 12 RR Packet Structure

The SR packet structure, shown in Figure 13, differs in format from the RR packet only in that it includes a 20-byte section of sender information.

RTCP SR Packet Sections
Header
Sender Information
Report Block 1
Report Block…n

Figure 13 SR Packet Structure

Receiver Report and Sender Report header structure

The RR and SR header structure is shown in Figure 14. The only difference between the two headers is the value for the packet type.

Figure 14: RTCP RR and SR Header Structure

Figure 14: RTCP RR and SR Header Structure

Version	Identifies the version of RTP. Windows XP supports version 2
Padding	If set to 1, then one or more additional padding octets have been appended to the end of the payload. The first padded octet indicates how many additional padded octets are included.
Reception Report Count (RC)	Indicates the number of reception blocks contained in the RTCP packet.
Packet Type	RTCP packet type. The value for an RR is 201 and for an SR is 200.
Length	Contains the length of the RTCP packet in 32-bit words minus 1.
SSRC	Contains the synchronization source identifier for the RTCP packet.

The additional 20-byte sender information included in an SR packet is shown in Figure 15.

Figure 15: RTCP SR Information

Figure 15: RTCP SR Information

NTP Timestamp	Contains the Network Time Protocol (NTP) time stamp or absolute wall clock time. If wall clock time is not available, then the sender can use the elapsed time since joining the RTP session for the NTP Timestamp value. If the elapsed time is used, then the high-order bit is set to zero. If neither wall clock time nor elapsed time is available, then the complete NTP Timestamp value is set to zero.
RTP Timestamp	Contains the same time as the NTP Timestamp, except that the RTP Timestamp is given in the same units and with the same random offset as the time stamp included in the header of the RTP packets.
Sender’s Packet Count	Contains the total number of RTP packets sent by the sender from the beginning of the RTP session up to the sending of this SR packet. This value is reset if, for some reason, the SSRC value of the sender has changed.
Sender’s Octet Count	Contains the total number of octets sent by the sender from the beginning of the RTP session up to the sending of this SR packet. This value is reset if, for some reason, the SSRC value of the sender changes.

Report block structure

SR and RR packets can contain zero or more report blocks. A report block, which is appended directly after the RTCP header, is received for each SSRC included in the RTP data packets received since the last report was received by the receiver.

The structure of report blocks is the same for both SR and RR packets, as shown in Figure 16.

Figure 16: RTCP Report Block Structure

SSRC_n	Contains the synchronization source indentifier for each report block included in the RTCP packet.
Fraction Lost	Contains the fraction of RTP packets lost from the source (SSRC_n) since the last SR or RR packet was sent.
Cumulative Number of Packets Lost	Contains the total number of packets lost from the source (SSRC_n) since the initiation of the session. This value is derived from the sequence numbers found in RTP packets, where the dropped RTP packets are indicated by a gap in sequence numbering.
Extended Highest Sequence Number Received	This field is divided into two parts. The least significant 16 bits contain the highest sequence number received in an RTP packet from the source (SSRC_n). The most significant 16 bits contain the number of sequence number cycles.
Interarrival Jitter	Contains an estimate of the variance in the interarrival time of RTP packets. This value is measured in RTP time stamp units and is derived from the difference between packet spacing, as measured from both the receiver and sender for two packets.
Last SR Timestamp (LSR)	Contains the middle 32 bits of the 64-bit NTP time stamp taken from the most recent RTCP SR from source SSRC_n.
Delay Since Last SR (DLSR)	Contains the time difference between the receipt of the last SR packet from the source SSRC_n and sending this reception report block, where each tick of this counter represents 1/65536 seconds.

Although RTP and RTCP are specifically designed for the needs of real-time communication over a packet-based network, they do not provide quality of service mechanisms. Instead, they leave quality of service issues to the underlying network and data-link layers.

Voice Quality Technologies

A circuit-switched network, such as the PSTN, provides a dedicated communication path between two end stations. Datagram-based packet-switched networks segment the original data into multiple packets, which are then separately routed through the network. By default, there is no dedicated path or bandwidth for datagram-based packet-switched networks. Because of these differences and the low tolerance for latency in real-time communications, toll-quality voice transmission can be obtained on a packet-switched network only after the following problems have been resolved:

Jitter	The variance in delay between packets. Voice transmission, unlike data transmission, is susceptible to the affects of jitter. Excessive delay between the sending of packets and their reception on the receiving end can cause for uneven, difficult-to-hear voice communication.
Packet Loss	Voice communication over a packet-based network is less tolerant of dropped packets than the same type of communication over a circuit-switched network. Excessive dropped packets loss can significantly degrade voice quality.
Packet Sequence	Because of the nature of voice communication, received packets need to be processed in the same order in which they were sent from the original source.
Acoustic Echo	Acoustic echo is the reflection of a sound signal. The power, or amplitude, of the acoustic echo and the amount of delay between the originating signal and the reflecting signal (the acoustic echo) determine whether the echo is detectible or bothersome to the person talking.

Windows XP provides several quality of service mechanisms, including jitter control, acoustic echo cancellation, and Quality of Service (QoS) protocols.

Jitter Control

RTP and RTCP provide information, such as time stamps and interarrival jitter values, that real-time communications applications can use to compensate for jitter during a session. An application’s jitter buffers use the time stamps and interarrival jitter values to make adjustments so that a smooth, even flow of packets is received.

Applications use the information received from the RTP and RTCP packets to calculate the difference in transit time for two packets. The calculation they use is:

D(n,n-1)=[R(n)-S(n)]-[R(n-1)-S(n-1)]

Where D(n,n-1) is the difference in transit time for packets n and n-1, S represents the time when packets (n,n-1) were sent, and R represents the time when packets (n,n-1) were received.

The difference in transmit time, D(n,n-1), is then used in the following formula, as described in RFC 1889, “RTP: A Transport Protocol for Real-Time Communications,” to determine interarrival packet jitter, J(n), as a smoothed running value of an RTP session:

J(n)=J(n-1)+(|D(n,n-1)|- J(n-1))/16

For more information, see RFC 1889 on the Web Resources page at https://www.microsoft.com/windows/reskits/webresources/default.asp.

Note: Both Windows Messenger and Phone Dialer have built-in jitter buffers.

Acoustic Echo Canceller

When a computer is used for real-time communications, such as voice calls, call participants can experience acoustic echo. Using a headset, which has an integrated microphone and speakers, as opposed to using a separate microphone and speakers, can eliminate some acoustic echo. To better control acoustic echo, an acoustic echo canceller (AEC) is needed.

Note Windows XP includes AEC support in both the Windows Messenger client and in Windows TAPI version 3.1.

Quality of Service

The RTP and RTCP protocols, jitter control mechanisms, and acoustic echo canceller provide applications with information and tools to monitor and improve the quality of real-time communications; however, none of these protocols or technologies has control over the underlying networking environment. QoS, a combination of IETF-defined protocols, such as Differentiated Services (Diff-Serv) and 802.1p, is used to provide different levels of control over the underlying networking environment and to provide varying degrees of quality of service.

Note Windows XP supports all applications that can use QoS, which are written specifically to make calls to the Windows XP QoS APIs.

Measuring Voice Quality

The Mean Opinion Score (MOS) scale provides a tool for subjectively measuring and rating voice quality. The MOS scale ranges from 1 to 5, where 1 indicates poor quality and 5 indicates excellent quality. Voice quality on the PSTN, also referred to as toll quality, generally ranges between 4 and 5 on the MOS scale.

The MOS scores for audio codecs with a 16-kHz sampling rate, such as SIREN and G.722.1, are approximately 4; however, because various codecs use different sampling rates, the user experience is different and the comparison is not quite applicable to the value received from the MOS scale. Because these codecs capture a wider range of frequencies, they actually offer a more enjoyable user experience by rendering more natural sound. Voice transmission over a packet-based IP network can now provide better sound quality than voice transmission over a PSTN network.

Note In voice transmissions over a packet-switched network, toll quality can be obtained only when latency is less than 200 milliseconds (ms). Nevertheless, even with a delay between 200 and 400 ms, a transmission is acceptable. But when the delay is greater than 400 ms, the audio connection is no longer acceptable.

The MOS scores for the audio codecs supported by Windows XP that use an 8-kHz sampling rate are shown in the Figure 17.

Figure 17: Windows XP 8-kHz Sampling Rate Audio Codec MOS Scores

Figure 17: Windows XP 8-kHz Sampling Rate Audio Codec MOS Scores

SIP Instant Messaging and Presence Language Extensions

SIP Instant Messaging and Presence Language Extensions (SIMPLE) allow users to send and receive instant real-time messages (generally text messages) and to know the current availability or status of other users. A general model for SIMPLE is described in RFC 2778, “A Model for Presence and Instant Messaging,” which is available on the Web Resources page at https://www.microsoft.com/windows/reskits/webresources/default.asp

The Instant Messaging model described in RFC 2778 defines communication between a server, defined as the Instant Message Service, and the clients, defined as either Senders or Instant Inboxes. When a message is sent from the Sender client to the Instant Message Service, the Instant Message Service forwards the message to the Instant Inbox client, as illustrated in Figure 18.

Figure 18: Instant Message Communication Flow

Figure 18: Instant Message Communication Flow

RFC 2778 defines the objects involved in the exchange and the communication among them; however, it does not specify the protocol to use for communicating presence and instant messaging information.

The Presence model described in RFC 2778 defines communication between a server, defined as the Presence Service, and the clients, defined as either Presentities or Watchers. The Presentity provides presence information to the Presence Service, and the Watcher receives presence information from the Presence Service.

There are two types of Watcher clients: Fetchers and Subscribers. A Fetcher requests only the current value of the presence information for a Presentity from the Presence Service. A Subscriber requests updates whenever the presence information for a Presentity changes. Figure 19 illustrates the relationship between presence clients.

Figure 19: Presence Clients

Figure 19: Presence Clients

SIP provides some presence information. For example, when a SIP user agent registers with a SIP registrar server, the presence or location of the SIP user agent is available from the SIP registrar server. This level of presence awareness allows the establishment of SIP-based calls; however, it does not allow SIP user agents to subscribe to other SIP user agents to obtain their presence information.

To provide SIP with the capabilities of Presence and Instant Messaging, two additional Internet drafts have been written: “SIP Extensions for Presence” and “SIP Extensions for Instant Messaging.” Two new SIP methods, SUBSCRIBE and NOTIFY, which provide presence capabilities in the SIP protocol, are defined in the “SIP Extensions for Presence” draft. One new SIP method, MESSAGE, which allows instant messaging capabilities in the SIP protocol, is defined in the “SIP Extensions for Instant Messaging” draft. For more information about SIP methods, see “SIP Protocol” earlier in this paper.

Summary

The Microsoft real-time communications platform is based on industry standards and is designed for corporate multi-modal communication, such as voice and video communication, instant messaging, application sharing, and collaboration. Windows XP supports SIP, which is used for creating and terminating call sessions; various codecs, which convert voice and video signals to digital format, and compress and decompress those signals for efficient transport; SDP, which describes multimedia sessions; and RTP and RTCP, which monitor communications sessions. Additionally, Windows XP includes a number of voice quality technologies, which improve the quality of voice communication over packet-switched networks. By supporting SIMPLE, Windows XP also provides presence and instant messaging capabilities.

See the following resources for more information:

The Microsoft Windows Real-Time Communications Web site on the Web Resources page at https://www.microsoft.com/windows/reskits/webresources/default.asp.
RFC 822: Standard for the Format of ARPA Internet Text Messages on the Web Resources page at https://www.microsoft.com/windows/reskits/webresources/default.asp.
RFC 1889: RTP: A Transport Protocol for Real-time Communications on theWeb Resources page at https://www.microsoft.com/windows/reskits/webresources/default.asp.
RFC 3261: SIP: Session Initiation Protocol on the Web Resources page at https://www.microsoft.com/windows/reskits/webresources/default.asp.
RFC 2778: A Model for Presence and Instant Messaging on the Web Resources page at https://www.microsoft.com/windows/reskits/webresources/default.asp.
The Windows XP Web site on the Web Resources page at https://www.microsoft.com/windows/reskits/webresources/default.asp, for the latest information about Windows XP.

The future is yours

Share via

On This Page

Introduction

RTC Call Processing

Session Initiation Protocol

SIP Components

SIP Call Flow

Sample SIP Architecture

SIP Protocol

Session Description Protocol

Audio and Video Digitization and Compression

Audio Digitization

Audio and Video Compression

Audio Bandwidth Capacity

RTP and RTCP

RTP

RTCP

Voice Quality Technologies

Jitter Control

Acoustic Echo Canceller

Quality of Service

Measuring Voice Quality

SIP Instant Messaging and Presence Language Extensions

Summary

Share via

Microsoft Real-Time Communications: Protocols and Technologies

On This Page

Introduction

RTC Call Processing

Session Initiation Protocol

SIP Components

SIP Call Flow

Sample SIP Architecture

SIP Protocol

Session Description Protocol

Audio and Video Digitization and Compression

Audio Digitization

Audio and Video Compression

Audio Bandwidth Capacity

RTP and RTCP

RTP

RTCP

Voice Quality Technologies

Jitter Control

Acoustic Echo Canceller

Quality of Service

Measuring Voice Quality

SIP Instant Messaging and Presence Language Extensions

Summary

Related Links

Additional resources