Building Reliable Background Processes for a Business Critical Application with Windows Azure
Technical Case Study
Published: July 2012
The following content may no longer reflect Microsoft’s current position or infrastructure. This content should be viewed as reference documentation only, to inform IT business decisions within your own company or organization.
Microsoft IT (MSIT) has migrated its pricing exemption portal (BCWeb) to the Windows Azure platform, to leverage the scalability, extensibility, and cost savings that Windows Azure can provide. As part of the migration, Microsoft IT reused preexisting background processes to enable an expedited migration and reduce development time. During the second part of the migration process, Microsoft IT rearchitected the BCWeb background processes to optimize them for Windows Azure, increasing performance, reliability, and cost savings.
Technical Case Study, 300 KB, Microsoft Word Document
|Solution||Benefits||Products & Technologies|
After migrating BCWeb to the Windows Azure platform, Microsoft IT recognized several areas where it could modify BCWeb to provide greater performance and reliability on Windows Azure, most notably in background processes.
The development team took advantage of the tools and methods available in Windows Azure to refactor application code and data-layer components in the BCWeb background services. These changes provided a more effective set of background processes for BCWeb, improving the application's performance and reliability.
BCWeb is an internal, web-based application that Microsoft uses to create business cases for product pricing exemptions. BCWeb is composed of three distinct application components: the core BCWeb component, the Workflow Routing and Approval system (WRAP), and Rapport. The core BCWeb component is responsible for providing a user interface (UI), and for the underlying functionality that enables users to generate business cases for pricing exceptions. WRAP routes the pricing exception requests for approval within the Microsoft corporate infrastructure. Rapport provides a user interface for the WRAP approval process. BCWeb has a user base of 2,500 internal Microsoft employees. In 2011, Microsoft used BCWeb to process almost 40,000 pricing exception requests.
BCWeb was migrated to Windows Azure™ as a pilot project to develop and capture best practices for migrating enterprise applications to Windows Azure.
BCWeb is implemented as a hybrid application. While the core BCWeb components are hosted on the Windows Azure platform, BCWeb also contains several components that are hosted on the MSIT corporate network, and external to the Windows Azure platform.
BCWeb is divided into three distinct Windows Azure Services, which in turn house the main application components: BCWeb, WRAP, and Rapport. The three applications are separated by design, to enable a modular approach to application updates and refactoring.
Windows Azure Components
The first component application—the BCWeb core—is implemented as a Windows Azure Web role that hosts the UI for generating business-case documents. BCWeb uses two worker roles. The first worker role hosts the core BCWeb Service and other services based on Windows® Communication Foundation (WCF), and the second worker role hosts background and notification processes that the BCWeb application uses. The WRAP application is implemented as a multiple instance worker role that contains all of the necessary services that are required to perform the routing and approval operations for BCWeb–generated business-case documents. The Rapport Windows Azure Service hosts the Rapport application. Rapport is composed of a web role that hosts the UI, and a worker role that hosts the Rapport WCF Service. SQL Azure™ databases host native data storage for the entire BCWeb application infrastructure.
On-Premises Distributed Components
BCWeb includes several critical components that the Windows Azure platform does not host. These components primarily provide access to external data that BCWeb functionality requires. The two primary external components are SAP (for business data), and the Microsoft corporate Active Directory® Domain Services database, for infrastructure and organizational data. Both of these components are outside the BCWeb management scope, but are critical to its functionality. Both components also are hosted on-premises within the Microsoft corporate network. An on-premises database—the Licensing Information Repository (LIR)—hosts information used for data warehousing. For reporting purposes, the BCWeb transactional SQL Azure databases perform ongoing information exports to the on-premises LIR database, which is hosted on Microsoft SQL Server®.
BCWeb Windows Azure Architecture
The following diagram represents the BCWeb logical structure, after the application was migrated to Windows Azure:
Figure 1. BCWeb Windows Azure Architecture
The primary reason for migrating BCWeb to Windows Azure was as a migration pilot project. However, the original version BCWeb also was experiencing performance and reliability issues in its previous environment.
Although the Windows Azure migration brought increased reliability and performance to BCWeb, the operations and development team noticed several areas in which it could modify BCWeb to provide greater performance on Windows Azure.
Many of the background processes and service-oriented components of BCWeb had been migrated to Windows Azure with relatively few code modifications from the previous version, which was not based on Windows Azure. While this made for a much smoother migration process, it also meant that many of the background processes were not designed for the Windows Azure platform. Therefore, many background processes did not take advantage fully of the Windows Azure platform capabilities.
BCWeb Performance and Reliability Issues
MSIT identified BCWeb performance and reliability issues in several different areas, including:
Concurrent processes within the background processes were altering data and performing tasks at the same time, because multiple instances were being implemented in Windows Azure. This concurrency was not controlled within the background processes, and resulted in situations where multiple processes attempted to modify the same application data.
Several aspects of connectivity and data transfer between components did not have effective methods for retrying connections and ensuring data consistency. Much of this occurred because of the Windows Azure integration with on-premises components.
The team found that workload and traffic was not balanced effectively among the multiple instances of the worker roles that were hosting background processes. Some instances were being utilized heavily, while others experienced relatively low usage, depending on what was occurring in the application.
Data connections between application components did not have built-in methods to handle component failure. For example, say a background process failed or an instance was unavailable to process requests. This resulted in the application sometimes being unable to determine where the failure occurred and from where to start the recovery process.
- High Availability
BCWeb needed to have coding structure that enabled a more graceful process for failover between instances, and which ensured a greater level of availability for background processes.
Overall performance was a concern for the development team. They felt that the refactoring of background processes and focusing on the points listed above would increase overall BCWeb performance.
To ensure an application structure that would perform better, the development team knew that it needed to refactor the BCWeb background processes to better suit the Windows Azure environment.
The development team outlined several design goals for the refactoring of the BCWeb background processes, based on their observation of shortcomings in coding and data structures. These goals included:
Design for concurrency
The team planned to alter BCWeb code to provide more effective control over concurrency. They knew that they needed to ensure that duplicate tasks and workflow items were not creating inconsistent data or presenting incorrect information to the application's users. They understood that the key to managing concurrency was to design the background process code so that it is aware of other instances that may be performing the same tasks. This enables the instance to respond accordingly.
- Design for increased reliability
Increased reliability was one of the design goals that the team knew would come from the overall refactoring of BCWeb background processes. However, there were some refactoring elements that would be directed specifically toward increasing BCWeb reliability.
- Provide increased recovery capability and resiliency
Recovery capability was an extremely important goal for the team. They wanted to ensure that any failures within processes were handled in such a way that it allowed the application to resume from the point of failure while simultaneously not impacting users significantly or resulting in inconsistent data.
- Design for high availability
Designing for high availability was a very specific goal the development team established. They wanted to ensure that the application processed workloads in a synchronized manner, so that the individual load on each background-process instance was not significantly smaller or larger than the load on any other instance. Additionally, the design team wanted to ensure that workloads could run in an asynchronous, parallel manner to increase performance.
- Enable reliable QA testing and models
The team also wanted to improve the capability to perform accurate and reliable quality assurance (QA) and testing within the environment. While refactoring, they planned to build this capability into the components that they were modifying.
General Design Plan
Once the development team established its design goals, they examined the BCWeb architecture to identify the key targets for refactoring, which included:
The core BCWeb service primarily is responsible for generating business cases for pricing exceptions and initiating the processes associated with each business case. There were several aspects of the BCWeb Service that required refactoring to meet the design goals.
- WRAP service
The WRAP service controls the workflow for routing requests and approvals for generated business cases.
- WCF services
- BCWeb also contained several WCF services that were responsible for carrying out a variety of tasks related to BCWeb and WRAP functionality. Many of these services played a part in the aspects of the application that the team planned to modify with its design goals.
- SQL Azure databases
The team also identified several changes that were necessary in the data-access layer, including changes to the SQL Azure structure and code.
Solution Challenges and Design Refactoring
To meet their design goals, the team implemented several changes that worked in concert to increase the reliability of BCWeb and make the application perform better.
Service Design for Multiple Instances in the BCWeb Service
There are several places within the BCWeb application infrastructure in which multiple instances are performing the same processes. In Windows Azure, this enables scalability, which means that as application demand grows, IT administrators of operations staff can add instances to a Windows Azure role. Each instance contains a discreet amount of computing resources, like processor, random access memory (RAM), and network bandwidth. When an instance is added to a role, the available pool of computing resources increases, along with the role's ability to handle an increased workload. Multiple instances also provide a means for application availability. If a process fails or has to restart in one instance, other instances in the role can take over and continue processing the request.
In the case of BCWeb, the original code for the BCWeb service, mostly reused from the old version, was not designed for an environment with multiple instances. As a result, the service's multiple instances were causing several issues, including:
- Multiple instances could be performing the same job simultaneously, which was not desired behavior for tasks that needed to be processed synchronously, such as modifying and querying AD DS domain data.
- Instances that restarted or failed in the middle of a job did not mark the resume point when the instance was finished restarting, or when another instance took over the job.
To overcome these shortcomings, the development team refactored the code to perform certain tasks synchronously, within a role. The result was that if a task involved synchronous data and was started in an instance, it would not be started in another instance. Furthermore, the development team altered the services to confirm and mark the completion of each step in a task, so that if the current instance running the service failed, another instance could continue the task without duplicating steps.
They also made modifications to SQL Azure code, so that the data-access layer could store the status of an incomplete task. This enabled instances to pick up an incomplete process by querying the SQL Azure database. The team also ensured that processes were maintaining consistency of application data by using database locking or more restrictive transaction-isolation levels within SQL Azure.
Finally, the team incorporated retry logic into the code base. With the retry logic, processes would not fail outright, simply because an external component was unavailable. For example, if a process attempted to update data in the SQL Azure database that was locked by another process, it wouldn't fail. Instead, it would retry the update until it succeeded or reached the timeout threshold for the process.
Throttling Processes and Maintaining Workload Balancing
The development team determined that another important aspect to modifying the BCWeb service architecture was accounting for even workload distribution between instances of the Windows Azure worker role. The way in which the BCWeb service was designed in the original BCWeb version did not account for a multiple-instance environment. Therefore, processes were being distributed in an uneven manner, which resulted in heavy workloads for some instances and virtually no workload for others.
The development team implemented a throttling method into the BCWeb service code that limited a specific instance's workload, based on the computing resources allocated to the instance. This way, an instance would not run any more instances than its computing resources could support adequately. When an instance's throttling threshold was reached, processes would be allocated to other available instances.
The development team implemented this code change to enable changes to an instance's throttling level, without making changes to the code itself. This allowed the operations team to monitor and control throttling to ensure the smooth and uninterrupted operation of BCWeb.
WRAP Workflow Refactoring
Several aspects of the WRAP workflow routing service needed to be refactored to make the workflow process resilient in the event of failure, and improve overall workflow-processing performance.
Unlike the BCWeb service, WRAP required no significant concurrency controls. For the most part, the approval workflow process is a sequential task, and workflow items are processed in a synchronous manner.
However, some aspects of changes made to the BCWeb service also carried over to WRAP. An extremely important change was refactoring to enable recovery from failure. Because the WRAP workflow is so dependent on sequential task processing, the team needed to ensure that the state of a WRAP workflow was always preserved and updated as the workflow process continued. Similar to the BCWeb Service, the development team used SQL Azure to store workflow state, in the case of instance failure. The WRAP service stored not only the state of the workflow process, but also any data that was being modified. The team stored workflow states by making use of SQL Azure database-transaction changes to the SQL Azure database structure.
Now, when a failure occurs, another instance identifies the unfinished workflow process in SQL Azure, and picks up where the failed instance left off. As a result, there is no data loss, and the workflow process is preserved. If the failure occurs in such a way that the workflow is not recoverable, the workflow is restarted, and the changes made to the database are rolled back by using SQL Azure transaction-log management. Because of these changes, the state of the workflow process can be recreated, and the context and data associated with the state also are stored, and then provided to the instance that resumes the workflow. This design change allows the work-role instances in the WRAP service to recycle gracefully, without compromising application data.
SQL Azure Database Modifications
As previously mentioned, several aspects of SQL Azure database structure and code were modified to fulfill the development team's design goals. Specifically, they accounted for concurrency in data modification by duplicate processes.
Some elements of concurrency management are handled at the application level in both the BCWeb and WRAP processes, but the majority are handled in SQL Azure, where concurrency and duplicate data management is critical to the integrity of the BCWeb application.
However, concurrent queries to SQL Azure may occur because several occurrences, such as instance load-balancing timeouts or query-retry failure, are not always detectable by the service-level code changes. The team has built logic into the database structure, including enforced uniqueness in multiple table columns, to ensure uniqueness of insert and update queries. They also have made modifications to stored procedures, for the same purpose.
Performance Improvements to the WCF Service
Several WCF-based services provide functionality to BCWeb. A number of changes to these services, hosted in several Windows Azure worker roles, were made to fulfill design goals, including:
- Improve relay times for Service Bus
Windows Azure Service Bus was used extensively in BCWeb to manage communication between application components. The team noticed several aspects of Service Bus behavior that needed to be monitored and managed to ensure the most efficient data exchange between components.
The team made changes to keep Service Bus connections open throughout the life of a process, which avoids the extra time taken to open and close Service Bus connections. While keeping the connection open was a simple code change, the team also had to account for the Service Bus idle-timeout period by building keep-alive logic into the WCF services. This ensured that the connection would remain open.
- Design WCF custom binding
When WCF services go offline, because of failure or instance restart, it is possible that requests are sent to the service while it is offline. Typically, when the request fails, it is passed to another instance of the WCF service. However, there were instances where WCF services that were coming back online tried to process requests that were made while it was offline. The team built logic into the WCF service to ensure that concurrent loads were distributed evenly between instances of the WCF services. It also built in thresholds for authentication requests to ensure that a single instance was not causing a backlog in synchronous processing.
- Enable a graceful role-recycle termination process
In the Windows Azure environment, role instances will perform a recycling process. This ensures the reliability of instance performance and provides a failsafe should any processes within the instance cause it to stop responding to application requests.
- In BCWeb, the team discovered that unmanaged role recycling could lead to data loss or other undesired application behavior. The team built logic into the application code to control the recycling process.
When an instance is ready to recycle, the logic ensures that current processes are completed before the recycle is performed, unless a timeout threshold is reached. If the timeout threshold is reached, perhaps because of an unresponsive process, the instance will recycle.
- Decrease business process latency by designing event-driven architecture
Throughout the BCWeb architecture, polling was used extensively to check for process status, and confirm task completion and various other state-related tests. To improve overall performance, and provide a more responsive and effective status management, the team redesigned several aspects of this behavior to be event-driven. In an event-driven scenario, when process "A" completes a task, it pushes notification to process "B", which is waiting for the task to complete. This results in decreased wait time and latency when compared to a pull notification, where process "B" continually polls process "A" to check for completion.
MSIT and the BCWeb development and operations team have observed a number of benefits from the refactoring of BCWeb background processes, including:
BCWeb is more reliable and available to the application's users. The application requires significantly less active monitoring and resolution of service-related errors. Database data remains more consistent, and communication between BCWeb components is more reliable and stable.
- Consistent user experience
The changes to BCWeb provide a more consistent user experience. Users experience fewer delays from timeouts and retries, and the application's performance has improved. Changes made to workflow and synchronous-task management also have decreased the number of restarts for workflow tasks because of component failure.
- Decreased complexity
The management of BCWeb functionality has been simplified. Application code is less complex, and the built-in retry and failover logic makes it much easier to troubleshoot potential issues with BCWeb functionality.
BCWeb is able to scale to meet application demand in a more graceful way after the code refactoring. Load balancing and workload-processing logic ensures that individual instances are not overloaded, and that they can respond appropriately to increased traffic.
- Cost Savings
While Microsoft IT did not directly gain measurable cost savings from the refactoring of BCWeb background processes, the changes in the resultant Windows Azure components did result in more efficient use of Windows Azure resources. The refactoring also was part of the complete BCWeb migration story, from on-premises to Windows Azure. Monthly costs for BCWeb on Windows Azure are $6579.99 U.S. dollars, which saves approximately 30 percent from the on-premises monthly cost of $10,209 U.S. dollars.
The changes to the BCWeb background processes have resulted in improved instance recycling and more efficient process handoffs. New instances can be added quickly and effectively, and changes in the development environment can be translated to production more efficiently.
- Consistent development and test environment
The changes made to BCWeb results in a more reliable component base for subsequent changes to BCWeb functionality. Developers can focus specifically on the desired result of new additions to the application, and less on making changes to the underlying background processes to make their changes work.
The refactoring of BCWeb background processes provided a more stable application in Windows Azure for Microsoft IT. The application has shown increased reliability and performance, and requires less operations management and troubleshooting. The refactoring process has provided a more stable base on which Microsoft IT can, in the future, build additional functionality and improvements into BCWeb.
For More Information
For more information about Microsoft products or services, call the Microsoft Sales Information Center at (800) 426-9400. In Canada, call the Microsoft Canada Order Centre at (800) 933-4750. Outside the 50 United States and Canada, please contact your local Microsoft subsidiary. To access information via the World Wide Web, go to:
© 2012 Microsoft Corporation. All rights reserved.
This document is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY. Microsoft, Windows, Windows Azure, SQL Azure and SQL Server are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries. The names of actual companies and products mentioned herein may be the trademarks of their respective owners.