Export (0) Print
Expand All
0 out of 2 rated this helpful - Rate this topic

Optimizing Lync 2010 Enterprise Voice Performance

Enhancing Lync 2010 Enterprise Voice experience and call quality

Technical White Paper

Published: January 2013

Download

Download Technical White Paper, 1.32 MB, Microsoft Word file

Situation

Solution

Benefits

Products & Technologies

As part of the Microsoft Operations Framework, deploying and managing Enterprise Voice entails making service improvements to decrease costs, increase efficiencies, and provide better user experience. After consolidating PBX sites, improving data-center infrastructure, and deploying Microsoft Lync Server 2010 Enterprise Voice, users experienced intermittent performance and call quality issues across the organization.

To deliver the best service quality, Microsoft IT initiated a quality improvement project for Enterprise Voice. By investigating upstream and downstream service dependencies, Microsoft IT isolated root causes in the underlying network, user practices, and configuration. Through a systematic remediation of root causes, practicing proactive monitoring, and reporting on system metrics, Microsoft IT continues to improve Enterprise Voice performance for more than 100,000 users.

  • Increase user satisfaction and adoption
  • Maximize existing investments in Enterprise Voice
  • Replace end-of-life hardware
  • Standardize and simplify network infrastructure and hardware
  • Develop and share best practices for Enterprise Voice architecture and operations
  • Facilitate reliable work-anywhere conferencing and voice services
  • Microsoft Lync Server 2010
  • Active Directory Domain Services
  • Windows Server 2008 R2
  • Enterprise Voice

JJ650858.arrow_px_down(en-us,TechNet.10).gif Executive Summary

JJ650858.arrow_px_down(en-us,TechNet.10).gif Enterprise Voice Infrastructure

JJ650858.arrow_px_down(en-us,TechNet.10).gif Performance Investigations

JJ650858.arrow_px_down(en-us,TechNet.10).gif Investigation Findings

JJ650858.arrow_px_down(en-us,TechNet.10).gif Improving User Satisfaction

JJ650858.arrow_px_down(en-us,TechNet.10).gif Lessons Learned and Best Practices

JJ650858.arrow_px_down(en-us,TechNet.10).gif Conclusion

JJ650858.arrow_px_down(en-us,TechNet.10).gif For More Information

Executive Summary

Providing highly effective communications tools to Microsoft employees and contingent staff (users) is critical to the success and future of Microsoft. Microsoft has deployed Microsoft Lync 2010 at the core of its Microsoft Unified Communications architecture to provide voice, video, and a wide range of collaboration tools to enable the workforce to maximize productivity and accelerate decision making and sharing of ideas. Microsoft is a global company and has user base of more than 175,000 employees and vendors who occupy 200+ offices throughout the world. Additionally, tens of thousands of users work remotely or only occasionally use a Microsoft facility. Enabling this diverse global workforce to effectively communicate and collaborate is a top priority within Microsoft Information Technology (Microsoft IT).

The culture at Microsoft encourages team members to make connections and communicate in a unified way that frees individuals from having to work at a single location or be present physically to meet and collaborate as a group. Even when workers reside in close proximity to the Redmond (Washington) headquarters, it is common for members to work from home and while traveling.

To support this collaborative culture, Microsoft IT implemented Microsoft Lync Server 2010, which supports Enterprise Voice (EV) as the latest service offering that enables workers to make and receive voice calls to their client devices free from traditional location-based telephony. Workers can conference and communicate by using both Lync IP phones and the Lync softphone client from anywhere in the world that has an Internet connection. This simplicity of service untethers workers to enable greater mobility according to what teams need to succeed in their projects.

To help ensure the highest voice quality, Microsoft IT systematically investigated its entire EV infrastructure and developed best practices for client use. Microsoft IT operates an environment with thousands of network devices and hundreds of public switched telephone network (PSTN) gateways. Workers have access to headsets that undergo testing and validation as part of a rigorous approved-devices program to be compatible with Lync Server 2010. Microsoft IT supports EV in the majority of sites except smaller Internet-connected offices. To support high EV service reliability and quality in such a diverse environment, Microsoft IT engaged in a process of auditing its entire infrastructure to isolate and remedy root causes of performance issues.

During the processes of auditing and investigating root causes, Microsoft IT discovered improvement opportunities for replacing hardware, updating firmware on network devices, making configuration changes, informing users about best practices, and changing the monitoring and operational processes to help ensure high voice quality. These improvements ultimately entailed optimizing the infrastructure that Lync uses for EV, including the following:

  • Review configuration and traffic at each EV gateway and standardize gateway configurations.
  • Create best practices to educate workers on ways to optimize client experience.
  • Analyze traffic and resolve common network issues, such as Quality of Service (QoS) tagging persistence, high latencies, packet loss, and jitter.
  • Ensure that Internet Protocol security (IPsec) exceptions exist for voice traffic.

This paper contains information for technical decision makers and IT professionals who want to optimize the EV experience.

Note: For security reasons, the names of internal resources and organizations used in this paper do not represent real names used within Microsoft and are for illustration purposes only.

Enterprise Voice Infrastructure

Microsoft IT first deployed EV with Microsoft Office Communications Server 2007. By the time the product group finalized the release, Microsoft IT had already rolled out EV to more than 4,000 endpoints and users. By the end of the 2008 fiscal year, Microsoft IT had enabled more than 25,000 end users with EV. As of this writing, Microsoft IT has 102,000 users enabled across 136 offices, which represents 85 percent of all Microsoft employees. Remaining employees reside in countries where voice over IP (VoIP) regulations make EV difficult to establish or are in smaller Internet-connected offices.

The business case for the initial deployment focused on realizing the following EV benefits:

  • Increased worker mobility EV provides features that enable users to seamlessly work outside the office. With traditional private branch exchanges (PBXs), the only way to work remotely was with mobile phones, and that system incurred extra costs for trunking and toll costs associated with call forwarding. EV enabled users to work wherever they needed to be without tying them to an office or incurring additional costs. Users can now place or receive phone calls from hardware phones or softphones when they are away from the office.
  • Building of infrastructure with future simplification and technologies in mind The overall EV implementation strategy encompassed multiple phases, starting with deploying Office Communications Server 2007, then replacing existing PBX devices with PSTN gateways or Session Initiation Protocol (SIP) trunking depending on location, and then implementing Lync Server 2010. From the beginning, Microsoft IT anticipated and planned for a multiyear approach that incorporated making incremental improvements to gateway technologies, making product improvements, and adopting better client devices. These early planning and scoping activities helped Microsoft IT to align its resources and implementation schedule for EV to roll out gradually according to the phased-in deployment plan.
  • Lync as primary phone Microsoft created a Unified Communications platform that included voice, whereas other platform providers started PBX replacement and bolted on Unified Communications applications. The user experience is at the center of the architecture, not an application that forces users to change their workflow in order to implement it.

Although implementing EV has occurred over several years, Microsoft IT continues to make this feature a priority in its production environment. As workers became aware of the new EV service offering, the demand from early adopters to be onboarded exceeded Microsoft IT's ability to meet the demand.

"In recent years, we at Microsoft have looked at work as something we do, not a place where we go. Lync Server 2010 Enterprise Voice helps us to do just that by making it possible to communicate from anywhere that has an Internet connection."

Jonathan Lewis
Sr. Service Manager
Microsoft Corporation

Topology

Each time Microsoft IT architects the next version of Lync, EV and conferencing considerations are highest priority. The topology gives special attention to business continuity, high-quality media flows, load balancing, and Lync Edge Server role scenarios. The design that Microsoft IT used for Lync 2010 is shown in Figure 1 and uses the following approach:

  • Geographically distribute according to user location The North America data center handles the majority of all users, approximately 60 percent. Dublin and Singapore data centers handle the other users in Asia, Europe, Australia, and the rest of the world.
  • Support business continuity The configuration for disaster recovery in the Americas region consists of four identical pools (two in each data center) running in an active/active configuration where each data center can handle 100 percent of the expected traffic if an event requires one data center to handle the entire load. Microsoft IT plans to implement this same disaster-recovery architecture in the Asia Pacific and Japan (APJ) and Europe, Middle East, and Africa (EMEA) regions in the future.
  • Create identical PSTN gateway and mediation configuration for each site Microsoft IT uses a standardized configuration for its 100+ sites. Every site uses two PSTN gateways for resiliency. All gateways use copied standard dial plans with the necessary customization made for the local site.

Figure 1. Server and data-center topology
Figure 1. Server and data-center topology

Architecture

At its core, EV follows a relatively straightforward architecture that combines analog PSTN voice data with digital VoIP data. Two components are responsible for signaling, encoding, and routing: the PSTN gateway that helps to route traffic, and mediation servers that facilitate encoding and signaling.

Mediation servers act as intermediaries between the internal Lync Server network that uses RTAudio and RTVideo codecs and media gateways that use G.711 and G.723 codecs. In Lync Server 2010, an important feature for Microsoft IT that reduces server load and latency is media bypass, which enables endpoints to route audio data directly to VoIP gateways without first routing through a mediation server. Figure 2 shows the traffic flow and architecture.

Figure 2. Example EV edge configuration
Figure 2. Example EV edge configuration

To ensure adherence to security policy and facilitate remote access scenarios, Microsoft IT uses the Lync Edge Server role, configured with three IP addresses for access, audio/video (A/V), and web conferencing. You can find more details about the firewall and perimeter network in the "Deploying Lync Server 2010" white paper at http://technet.microsoft.com/library/hh745324.aspx or the Lync Server 2010 Resource Kit at http://www.microsoft.com/en-us/download/details.aspx?id=22644.

Workload

For optimizing EV performance, Microsoft IT monitors bandwidth utilization to understand typical usage scenarios and ensure appropriate availability of wide area network (WAN) bandwidth. Table 1 lists the theoretical audio payloads, in kilobits, for EV codecs.

Table 1. EV Codec Theoretical Payloads

Codec

Scenario

Payload

Payload and IP header

Payload and IP header, plus User Datagram Protocol (UDP), Real-Time Transport Protocol (RTP), and Secure RTP (SRTP)

All that plus forward error correction (FEC)

RTA-wide

Peer-to-peer

29.0

45.0

57.0

86.0

RTA-narrow

Peer-to-peer, PSTN

11.8

27.8

39.8

51.6

G.711

PSTN

64.0

80.0

92.0

156.0

G.722

Conferencing

64.0

80.0

95.6

159.6

Siren

Conferencing

16.0

32.0

47.6

63.6

Microsoft IT opted to increase its bandwidth modeling based on practical experiences of its user base. Table 2 shows the bandwidth utilization model that Microsoft IT uses. Values are in kilobits.

Table 2. EV Microsoft Bandwidth Modeling

Codec

Typical bandwidth

Maximum without FEC

Maximum with FEC

RTA-wide

39.8

62

91

RTA-narrow (peer-to-peer)

29.3

44.8

56.6

RTA-narrow (PSTN)

30.9

44.8

56.6

G.711

64.8

97

161

G.722

46.1

100.6

164.6

Siren

25.5

52.6

68.6

 For traffic routing to occur, the EV components use the routing and traffic flow path shown in Figure 3. The traffic relies on the following protocols:

  • Session Initiation Protocol (SIP) EV uses SIP for all aspects related to call signaling, including establishing a session, termination, and media negotiation between two parties. Signaling protocols such as SIP carry the IP addresses and ports of the call participants that receive RTP streams.
  • Transport Layer Security (TLS) Server-to-server, server-to-client, and front-end-to-mediation connections all rely on TLS or Mutual Transport Layer Security (MTLS).
  • SRTP Gateways in the network are configured to use TLS and because of that, real-time data such as audio and video, including EV, use SRTP. SRTP uses 128-bit Advanced Encryption Standard (AES) stream encryption. Lync Server establishes a media path that can traverse firewalls and network address translations (NATs) before allowing A/V traffic to flow between two endpoints.
  • Interactive Connectivity Establishment (ICE), Simple Traversal of UDP through NAT (STUN), and Traversal Using Relay NAT (TURN) ICE specifies a protocol for setting up RTP streams in a way that enables the streams to traverse NAT and firewalls. Because NAT alters IP addresses and ports, connections may fail or experience quality issues. ICE uses protocols such as STUN and TURN to establish and verify connectivity. STUN reflects the NAT IP addresses of the external user's endpoint visible to the internal user's Lync client, which helps the external user's Lync client to determine which IP addresses other clients see across firewalls. TURN allocates media ports on the A/V edge server to allow the internal user's Lync endpoint to connect to the external user's Lync endpoint. The internal endpoint cannot connect directly to the external endpoint because of the corporate firewalls. Therefore, by dynamically allocating a media port on the A/V edge server, the internal endpoint can send media to the external endpoint over this port.

Figure 3. EV workload
Figure 3. EV workload

For a comprehensive list of ports, protocols, and their uses, see "Ports and Protocols for Internal Servers" at http://technet.microsoft.com/library/gg398833.aspx.

Performance Investigations

After operating EV since 2008 and deploying Lync Server 2010 Enterprise Voice in 2011, Microsoft IT experienced performance and quality issues periodically in specific sites and sporadically across the entire organization. Improving service quality is a key aspect of following the Microsoft Operations Framework (MOF). After deploying and stabilizing a service, Microsoft IT engages in optimization and service improvement activities.

As a way to keep track of the quality-related issues and overall status, Microsoft IT implemented an internally developed reporting solution and assembled an audio-quality triage team dedicated to troubleshooting and resolving anything related to EV. Among other things, the team examines any user escalations related to voice quality to understand root causes, and PSTN gateway performance reports. Table 3 shows a sample report.

Table 3. Voice Quality Daily Report

Name of gateway or segment

Location

Streams

Duration < 30 seconds

Duration 30–60 seconds

All

Poor

Poor %

Streams

Poor %

Streams

Poor %

Redmond-33

North America

2,632

45

1.7%

25

1%

3

0.1%

Dublin-112

EMEA

1,015

10

1.0%

15

1.5%

2

0%

The reports extend in 20-second increments to 120 seconds of reported low voice quality, and they include network segments and gateways. As a finer layer of reporting detail, the reports also cover wired and wireless networks, media bypass calls, and conferencing performance for the PSTN gateway and mediation server.

Teams

The dedicated team for handling EV quality issues and owning overall EV service consists of members from the following groups:

  • Online Services Within Microsoft IT, the Online Services group includes infrastructure engineers, network architects, and service specialists who have a close familiarity with the Microsoft network and physical infrastructure. This group handles architecting of the service, deployment and support for Lync Server and other hosted services, and supporting internal Microsoft IT needs as well as hosted clients.
  • Microsoft IT As a keystone part of the team, Microsoft IT as a whole owns the voice-quality improvement effort. This group is ultimately responsible for the end-user experience. It handles all network infrastructure and configuration not specifically assigned to Online Services, manages media gateways, communicates with users to determine root causes of performance issues, and generally manages the service and functionality of EV and Lync Server 2010.
  • Product group The Lync Server product group connects with the other team members to understand real-world performance needs and results within Microsoft, in order to incorporate feature and functionality changes to Lync Server or create updates as necessary.

Processes

After rolling out Lync Server 2010 and accelerating EV onboarding, Microsoft IT focused on improving performance as part of its culture of improvement and following MOF standards. Users began escalating more service tickets to higher service tiers for resolution, including to IT executive leadership. Microsoft IT leadership formed the performance investigation team to conduct daily triage meetings until all issues are resolved. The team works together by using the following meeting cadence and work process:

  • Daily summary Each morning, the team generates five reports on wired, wireless, media bypass, Audio/Video MultiPoint Control Unit (AVMCU), and media-gateway-to-mediation-server performance. For any discovered issues, a member of the networking team automatically follows up to obtain more information and try to determine the root cause. The reports, combined with weekly prioritization meetings, strike a balance between proactive improvements and investigation and reactive troubleshooting and issue resolution.
  • Weekly end-of-week retrospective On Friday afternoon, the team creates an executive summary of all work done for the week, closes completed tasks and work items, and communicates progress to the executive leadership team about the status of remaining tasks. To report on progress, the team presents summary findings, status, risks, and other project status to executive leadership.
  • Executive triage A biweekly audio-quality triage meeting occurs on Tuesday and Thursday. During the triage, the team also examines other data, such as monitoring server reports, to identify any new issues and trend performance of known sites that have issues.

Tools

The EV quality team uses many typical and some custom tools to gather information about performance issues. The tools fit into the following categories:

  • Issue qualification The team investigates three types of issues: client experiences of reported sustained low voice quality, issues relevant to the entire network and data centers, and site-specific issues. As a first step, the investigating team member qualifies the issue and tries to obtain all relevant details about it. For an individual user, that may mean gathering statistics or network traces as close as possible to the time the issue occurred. For systemic issues, it may be a collaborative, multiple-week effort to determine possible root causes through close examination. The most helpful tool here is a custom script that the team created. It captures various details and a network trace from the user's computer, and then uploads that data to a shared directory for the team to investigate.
  • Monitoring and reporting The Quality of Experience (QoE) and pre-call diagnostic data that Lync includes often provide valuable insights into possible root causes, especially when combined with overall monitoring. For example, in one case, Microsoft IT discovered that some sites had high CPU utilization that coincided with periods of low voice quality during a peak usage time of the day.
  • Logs and performance counters Lync server roles have distinct performance characteristics for the type of tasks that each one performs. For example, A/V servers need CPU cycles, whereas monitoring servers need disk throughput. When site-specific or general issues occur, the team investigates logs and performance counters to try to correlate possible causes. This is also useful for examining traffic patterns among sites to identify possible root causes based on differences in the patterns.

Investigation Findings

EV traffic involves many Microsoft internal network devices and the Internet, in addition to servers running Lync server roles. In the flow of traffic, the complexity adds many possibilities for the root causes of performance issues. For example, the cause might be related to client configuration or due to use of a nonstandard computer, network issues, or configuration drift. Because EV depends on underlying services, such as Active Directory Domain Services (AD DS), the availability and performance of these dependencies also affects EV quality. Moreover, when workers, guests, or partners make audio and video calls over the Internet, there is no assurance of enforcing service quality because Microsoft does not own or manage the underlying infrastructure and has no ability to manage QoS on the Internet.

Because of the multiple variables that affect EV quality, Microsoft adopted the following approach for attempting to determine root causes of performance issues:

  1. Establish a baseline through a minimalist configuration As a standard best practice for any type of performance analysis, it is critical to identify the pattern, configuration parameters, and settings that constitute acceptable voice quality. In doing this, Microsoft IT can spot traffic and performance anomalies in logs, key indicators, gateway and server configurations, and client devices.
  2. Investigate upstream and downstream causes After Microsoft IT gathers data related to a specific performance issue, it examines the end-to-end traffic flow to identify the possible affected configuration items and devices that might be causing the issue. For example, if a user reports low voice quality and the indications show high levels of background noise with no network issues, the initial emphasis is on ruling out or verifying client performance as the root cause before moving to another possibility in the traffic flow, such as PSTN gateway or underlying network.
  3. Verify QoS for each segment In many instances of voice quality issues, the root cause was that QoS settings were not applied end to end. Microsoft uses a Differentiated Services Code Point (DSCP) setting of 46, expedited forwarding, for audio. Whenever QoS is suspected as a root cause, Microsoft IT staff captures packets that are routed or switched in each network hop, such as from user computer to gateway and from gateway to mediation server, to verify that the DSCP setting is not removed or modified.
  4. Understand categories of possible causes EV performance issues for Microsoft most often are due to QoS settings, configuration drift, and PSTN gateway implementation. To determine the possible causes, the investigator must know the universe of possibilities. During the process of examining possible root causes, Microsoft IT adds to the knowledge base to aid in troubleshooting.
  5. Drill down to root cause From the available possibilities and investigation data, Microsoft IT validates or rules out possibilities in a diagnosis process until it establishes one or more root causes.
  6. Systemically audit and remediate After determining root cause, Microsoft IT remediates it across the entire organization by auditing all devices across all sites. Although this was a manual process in the past, Microsoft IT has recently deployed monitoring technologies that automatically identify any devices in which QoS settings have been modified or fall outside the known good configuration.

"The importance of QoS cannot be overstated in our 100,000+ user Enterprise Voice production environment. We experienced a significant increase in quality after uniformly enforcing and auditing QoS settings."

Wayne Lewis
Team Lead
Microsoft Corporation

Quality of Service

Of all the root causes of performance issues, more than 50 percent are attributable to QoS in some way. By default, Lync Server 2010 does not have QoS enabled because it is designed to operate with acceptable performance levels over typical network configurations. Yet, QoS provides an optimization opportunity for Microsoft IT to treat Lync traffic as high priority.

The core idea behind QoS is to classify network traffic according to Differentiated Services (DiffServ). Through DiffServ, network devices tag packets with a number based on DSCP markings. These markings inform each network device that handles packets to prioritize the packets according to a policy defined in the configuration of the handling device.

Lync Server relies on QoS through Windows Group Policy Objects (GPOs) to specify port ranges for each real-time communication type. Because the A/V edge service is not a domain member and thus does not pick up Windows GPOs, local Lync Server–specific settings are used for edge services. Microsoft IT enables all media services that have a DSCP value of 46 so that all EV traffic at any point on the network is identified as high priority.

In optimizing EV performance, Microsoft IT quickly discovered that verifying QoS settings at every network device is crucial to high voice quality. Therefore, the team conducted a worldwide audit of every network device associated with EV-enabled sites to ensure that QoS is enabled and configured correctly. An additional technique available in Lync 2010, which Microsoft IT did not implement, is policing across the network by using Call Admission Control (CAC). CAC enables administrators to place restrictions on audio and video transmissions based on available bandwidth. Microsoft IT sizes its WAN links to accommodate peak loads at all sites.

In many sites, QoS worked as designed. Yet in other sites, some routers and networks experienced issues. Even when network devices run the latest and best configuration version and firmware, Microsoft IT discovered network configuration drift in its investigations. Since discovering configuration drift, Microsoft IT implemented a monitoring process to proactively audit all devices to ensure that these devices are running a known, good configuration. The monitoring process also makes sure that the devices have correct QoS settings across the organization and that users receive notifications when changes occur. Microsoft IT validated the end-to-end environment to provide the best user experience.

When workers travel or work from locations other than Microsoft sites—such as coffee shops, airports, hotels, or home—they rely on a variety of Internet connections across wireless and wired networks. These connections typically have no QoS capabilities, with no assurance of having a high-quality link. In cases of supporting users who connected over public access points that multiple people share, Microsoft IT experienced performance issues that were due to limited bandwidth or limited capacity for the user volume.

For more information about QoS planning and concepts, see "Managing Quality of Service" at http://technet.microsoft.com/library/gg405409. And for more information about deploying QoS for Lync Server 2010, see the deployment guide at http://www.microsoft.com/en-us/download/details.aspx?id=12633.

Media Gateway and Internet Ingress/Egress

As EV traffic traverses the perimeter network through a Lync edge server and the media gateway, or bypasses a mediation server to be encoded or decoded, the potential exists for quality degradation. Performance issues in this aspect of Lync traffic may stem from many causes, ranging from inadequately sizing Lync servers to under-provisioning the WAN bandwidth. The most commonly discovered root causes include the following:

  • Packet loss, jitter, and latency   The quality of the underlying connection and its bandwidth are critically important to maintaining high EV quality. Real-time communication requires connectivity that has minimal jitter and packet loss, and low latency. Microsoft IT sizes its WAN links to accommodate peak loads at all sites.
  • Bit rate limiting and WAN bandwidth contention   By default, Lync Server uses the RTAudio codec for encoding voice traffic for peer-to-peer calls and the G.711 codec for calls being routed to the PSTN via media bypass. The RTAudio codec is designed to operate on unmanaged networks and use available bandwidth while preserving quality. In peer-to-peer calls, this traffic traverses the WAN by using the available WAN bandwidth. During conference calls with many participants or in heavy load times, the WAN link may become saturated, leading to low call quality. Microsoft IT limits the rate on non-Lync traffic, in addition to QoS, to help ensure bandwidth availability. Lync Server 2010 also supports CAC to route voice traffic over the Internet in case of WAN saturation. For more information, see "Planning for Call Admission Control" at http://technet.microsoft.com/library/gg398842.
  • VPN overhead   VPN connections introduce additional packet and encryption overhead that results in low voice quality for some users. This happens even when other network indicators, such as latency, perform within the acceptable threshold. As a best practice, Microsoft IT recommends that workers who are using Lync outside the corporate network rely on normal Internet traffic that is not within a VPN connection. For more information about bypassing VPN tunnels, see "Enabling Lync Media to Bypass a VPN Tunnel" at http://blogs.technet.com/b/nexthop/archive/2011/11/15/enabling-lync-media-to-bypass-a-vpn-tunnel.aspx.
  • TCP/IP tuning   When Microsoft IT examines traffic for QoS tagging and overall performance settings, it also verifies that typical TCP/IP settings such as Maximum Transmission Unit (MTU) are optimized.
  • STUN-related DSCP tagging   Microsoft IT encountered a STUN issue related to a specific media gateway model where the media gateway did not tag RTP traffic with the correct DSCP setting after receiving it from Lync. After diagnosing the issue through network packet analysis, Microsoft IT implemented a fix specific to the gateway model.

Service Dependencies

In the Microsoft corporate production environment, Microsoft IT relies on other services that have the potential to degrade EV quality if they are unavailable or have poor performance. The following dependencies have generated EV quality issues for Microsoft IT:

  • IPsec exceptions All internal traffic must comply with Microsoft security policy, including IPsec encryption. However, voice traffic is an approved exception. When Microsoft IT deploys each gateway, it adds that gateway to the IPsec exception list. In managing its environment, Microsoft IT discovered that sometimes configuration drift happened, removing exceptions, which required Microsoft IT to add them again.
  • Load balancing The classic issue concerning load balancing is maintaining TLS sessions for session persistence and configuring routing to support bidirectional traffic flow with the media gateways. Microsoft IT uses a standardized approach to configurations, yet during audits discovered that some gateway configurations had changed over time.
  • Microsoft Forefront scanning Real-time voice traffic has no need to be scanned the same way as data traffic. Therefore, Microsoft IT configures an exception on Forefront servers to eliminate scanning of Lync traffic. In troubleshooting performance of each network segment, Microsoft IT discovered that configuration changes in some cases removed the exception, which enforced scanning and introduced delay.

Client Devices and Experience

Microsoft IT analyzes possible root causes of performance issues by examining each possible network device and segment that handles EV traffic. Many times, the cause rests in user configuration details or user practices and not in the underlying network. The following user practices and configurations proved to be a source of performance issues:

  • Approved devices Microsoft IT tests and approves each Lync Certified device for internal usage. This helps ensure the best possible user experience and provides a consolidated list for users to make purchasing decisions. When workers used unapproved devices, it resulted in occasional performance issues.
  • Education From the beginning of the Lync Server implementation, Microsoft IT made many online real-time, on-demand, and in-person resources available for communicating best practices across the organization. When users do not follow best practices, Microsoft IT makes sure that they are aware of the knowledge bases and learning opportunities.
  • Hardware issues While working closely with individual users at sites to investigate performance issues, Microsoft IT realized that some laptop models or network adapter chipsets caused performance issues. In these cases, Microsoft IT updated drivers or replaced hardware.

Improving User Satisfaction

Because traditional telephony services provide high levels of reliability, workers expect that same level of reliability for EV. After Microsoft IT concluded the majority of its improvement efforts, the overall percentage of calls that internal metrics and user reports deemed unacceptable in quality decreased from 2.4 percent to below 0.3 percent. Microsoft IT made sure to focus on user experience across all sites for EV functionality for conferences and voice calls, as well as overall Lync Server usability. To that end, the performance improvement team relied on existing Microsoft IT programs, such as for user training and education, and approved devices, in addition to individualized user outreach and information gathering to resolve performance issues.

User Education and Best Practices

As part of Lync Server 2010 deployment, Microsoft IT organized dedicated knowledge repositories on the company intranet in the form of an adoption and training kit; available at http://lync.microsoft.com/adoption-and-training-kit/Pages/default.aspx. This kit, along with online self-guided, online instructor-led, and in-person education, formed the basis for sharing best practices around EV. During the performance issue investigations, as the team made new discoveries for best practices and client configurations, the team updated best practices to make them available to Helpdesk staff in addition to all users. The best practices for client configuration and use include the following:

  • Use wired network whenever possible Using a wired network, even while on a Microsoft-managed site, helps to avoid any potential for wireless-related performance issues. Microsoft IT continues to collaborate with the Lync product group and wireless vendor to improve the technology, with the goal of achieving parity in performance and reliability between wireless and wired networks.
  • Implement optimized wireless settings In situations where workers can control configuration of wireless access points for Internet access, Microsoft IT recommends using the 802.11n standard.
  • Use only approved devices Microsoft IT tests devices and works closely with hardware manufacturers to help create headsets, phones, and other devices that perform well with Lync Server 2010. Microsoft IT evaluates devices based on supportability, return merchandise authorization (RMA) advance replacement process, uniqueness, compatibility, and performance, in addition to features that enhance usability and end-user experience. The evaluation process consists of the Lync Service Management team working closely with the Lync Server product group to conduct rigorous evaluation of the devices and ranking performance in various categories, such as quality of voice and form factor. Those that do not perform to high standards do not make the list.

Approved Devices

EV offers workers a device-agnostic way to use VoIP through Lync Server 2010 software, opening the possibility that device manufacturers can provide headsets and optional phones that connect directly to the network. To handle all these variations in preference and performance, yet provide guidance and recommendations for suggested devices, Microsoft IT runs an OEM evaluation program. In this program, manufacturers can send in devices for review, even at the prototype stage, to receive feedback from Microsoft as a Lync customer. To help users make choices, Microsoft IT maintains an approved list of recommended devices that meet its criteria and that provide an excellent user experience. This list also reflects devices that are supported by the Microsoft internal support organization.

Microsoft IT discovered that after users received devices with softphone functionality, they no longer relied on hardware phones. For example, many users who participated in conference calls took their laptops and continued a call while moving to another location, instead of terminating the call. Using a softphone approach greatly increased mobility for Microsoft workers because they could work from anywhere in the world, as long as they had an Internet connection.

Service Improvement Program

When Microsoft IT researched user experiences and performance issues as part of its service improvement program, it sought to interact with users in real time to capture data related to voice quality. The team took the following approaches to gathering data:

  • Site-specific surveys Microsoft IT sought volunteers who experienced performance issues that were due to escalations or internal metrics, and who were willing to provide detailed feedback and statistics to resolve persistent issues.
  • Deep analysis of performance issues Microsoft IT carefully examined the case of each volunteer to determine and remediate performance issues.

When performance issues happened consistently at a site, Microsoft IT pursued the following approach:

  1. Triage all the tickets to better understand underlying behavior and identify patterns that may contribute to performance issues.
  2. Recruit volunteers to answer surveys about specifics of performance issues, such as incidence rate during peer-to-peer calls or conferencing.
  3. Closely work with volunteers to obtain data for low-voice-quality events.

To facilitate data gathering as part of the service improvement program, Microsoft IT developed a custom script to run on client computers. This script automates gathering of Lync Server logs, Network Monitor data, and Performance Monitor data associated with low-voice-quality events, and uploads a .cab file to a file server. Typically, the script generates 5 to 10 files each day, which the performance improvement team investigates.

Lessons Learned and Best Practices

In the course of optimizing EV performance, Microsoft IT developed the following lessons learned and best practices:

  • Be aware that users do not always call Helpdesk Microsoft culture has a high tolerance for continuous improvement, with associated temporary periods of instability. Users often wait for services to stabilize and underreport issues. This cultural phenomenon makes it crucial for Microsoft IT to proactively monitor for performance issues and establish baseline standards and metrics.
  • Audit for configuration drift Microsoft IT has, in previous efforts, implemented many improvements to monitor servers and IP-based devices. Yet before the EV performance optimization, no equivalent standard monitoring program existed for EV. With a monitoring and audit program in place, staff know when configuration changes result in a noncompliant state.
  • Standardize configuration and hardware Another source of performance issues stemmed from the diversity of hardware implemented over the years. By standardizing hardware to a common set of devices as much as possible, Microsoft IT enables its staff to proactively manage the environment in a more centralized and unified way.
  • Replace end-of-life devices As hardware manufacturers create network devices to accommodate historical networking needs, at times these devices are not optimized with QoS or EV in mind. This issue requires Microsoft IT to update firmware or replace devices.
  • Verify QoS at each network segment For every persistent quality issue, Microsoft IT captured packets and examined QoS settings to ensure that teach network device maintained DSCP tagging end to end.
  • Educate users User configuration accounted for 20–30 percent of all performance issues. Educating users proved vital in helping to reduce reported tickets.
  • Use approved devices Through its certification program, Microsoft tests devices and offers users a wide selection of devices that are proven to work with Lync.
  • Create performance baseline In evaluating the existence of EV quality issues, it is helpful to establish a baseline of a known good configuration and expected performance and constantly proactively monitor the environment to identify any devices that fall outside that known good configuration.
  • Establish voice quality ownership With so many contributors and dependencies, Microsoft IT must ensure that the product group, operations, and its own engineers all collectively own voice quality to foster close collaboration and resolve issues.
  • Collect critical log data Historical data and trending are useful to identify performance issues, but real-time information is necessary to troubleshoot real-time protocols and services such as EV. Microsoft IT developed a custom tool to facilitate data gathering when clients experience performance issues.
  • Clarify client experience User behavior and client configuration were the root cause of a majority of tickets. Microsoft IT examines clients and user device configurations for details such as make and model, driver, type of conference or call, and wired and wireless networks.

Conclusion

In 2011, Microsoft IT implemented Lync Server 2010 EV as the latest service offering that enables workers to make and receive voice calls to their client devices free from traditional location-based telephony. As user count increased over the year, performance and quality issues began to arise periodically in specific sites and sporadically across the entire organization.

To support high EV service reliability and quality in such a diverse environment, Microsoft IT engaged in a process of auditing its entire infrastructure to isolate and remedy root causes of performance issues. As a result of the effort, the Lync operations team developed a set of lessons learned and best practices that customers may find beneficial to their own Lync environments.

By creating an audio-quality triage team, Microsoft IT discovered improvement opportunities for replacing hardware, updating firmware on network devices, making configuration changes, informing users about best practices, and changing the monitoring and operational processes to help ensure high voice quality.

To determine the root causes of performance issues, Microsoft established minimalist configuration baselines, investigated upstream and downstream causes, verified QoS for each segment, categorized possible causes, and remediated.

The triage team reviewed configuration and traffic at each EV gateway and standardized gateway configurations, created best-practices guidance for users, analyzed traffic, and resolved common network issues such as QoS tagging persistence, high latencies, packet loss, and jitter. And to help ensure high conversation quality, the team enforced IPsec exceptions for voice traffic. It turned out that QoS configuration was a major root cause of performance issues. To resolve that issue, the team conducted a worldwide audit of every network device associated with EV-enabled sites to ensure that QoS was enabled and configured correctly. The team did not implement CAC in this case, though customers may find that feature beneficial in their environments.

Ultimately, user satisfaction improved. Best practices for operations will continue to support more user satisfaction in the future as the more than 175,000 users depend on the service as a critical tool to perform their work across the company.

For More Information

For more information about Microsoft products or services, call the Microsoft Sales Information Center at (800) 426-9400. In Canada, call the Microsoft Canada information Centre at (800) 563-9048. Outside the 50 United States and Canada, please contact your local Microsoft subsidiary. To access information through the World Wide Web, go to:

http://www.microsoft.com

http://www.microsoft.com/technet/itshowcase

The information contained in this document represents the current view of Microsoft Corporation on the issues discussed as of the date of publication. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information presented after the date of publication.

This White Paper is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED, OR STATUTORY, AS TO THE INFORMATION IN THIS DOCUMENT.

Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under copyright, no part of this document may be reproduced, stored in or introduced into a retrieval system, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or otherwise), or for any purpose, without the express written permission of Microsoft Corporation.

Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in this document. Except as expressly provided in any written license agreement from Microsoft, the furnishing of this document does not give you any license to these patents, trademarks, copyrights, or other intellectual property.

Unless otherwise noted, the example companies, organizations, products, domain names, e-mail addresses, logos, people, places, and events depicted herein are fictitious, and no association with any real company, organization, product, domain name, e-mail address, logo, person, place, or event is intended or should be inferred.

© 2013 Microsoft Corporation. All rights reserved.

Microsoft, Active Directory, Forefront, Lync, and Windows are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries.

All other trademarks are property of their respective owners.

Did you find this helpful?
(1500 characters remaining)
Thank you for your feedback
Show:
© 2014 Microsoft. All rights reserved.