Planning for Reliability and High Availability
Summary: This article describes ways to reduce or eliminate downtime in a Microsoft Commerce Server 2000 environment. It focuses primarily on the hardware and software needed to create a Commerce Server site with no single point of failure. (40 printed pages)
Designing a Highly Available E-Commerce Site
A Highly Available Commerce Server Architecture
Operating System Availability
Active Directory Availability
SQL Server Availability
Electronic commerce is a "mission critical" operation, and a significant source of revenue for many companies. When any part of an e-commerce site is unavailable, the company might well be losing money. This white paper describes ways to reduce or eliminate downtime in a Microsoft® Commerce Server 2000 environment.
Hardware failure, data corruption, and physical site destruction all pose threats to an e-commerce site that must be available close to 100 percent of the time. You can enhance the availability of your site by identifying services that must be available, then identifying the points at which those services can fail. Increasing availability also means reducing the probability of failure. Decisions about how far to go to prevent failures are based on a combination of your company's tolerance for service outages, the available budget, and the expertise of your staff. System availability directly depends on the hardware and software you choose, and the effectiveness of your operating procedures.
This white paper focuses primarily on the hardware and software needed to create a Commerce Server site with no single point of failure. However, operating procedures also can have a significant impact on service availability. To avoid service outages, you must carefully consider service availability for all operating procedures.
Availability is a function of whether a particular service is functioning properly. You can think of availability as a continuum, ranging from 100 percent (a completely fault-tolerant site that never goes offline) to 0 percent (a site that’s never available). All sites have some degree of availability. Many of today’s companies target "3 9s" availability (99.9 percent) for their Web sites, which means that there can be only approximately 8 hours and 45 minutes of unplanned downtime a year. Telephone companies in the United States typically target "5 9s" or 99.999 percent uptime (5 minutes and 15 seconds of unplanned downtime a year). Although any company might strive for additional uptime for its Web site, significant incremental hardware investment is required to get those extra "9s."
You can create an availability checklist to monitor the availability of your site. The availability checklist should contain the items listed in the following table.
|Bandwidth usage: per day, week, and month||
|Network availability||Network Internet Control Message Protocol (ICMP) echo pings (available from most network monitoring software).
Compare your network availability to the level agreed to in your Service Level Agreement (SLA) with your Internet service provider (ISP) or data center provider. Request improvement if network availability falls below the level agreed to in the SLA.
The formula for measuring network availability is as follows:
(Number of successful ping returns/number of total pings issued) x 100%
|Performance metrics (For more detailed information about monitoring performance, see the white paper titled Maximizing Performance.)||
This section describes how to design a highly available e-commerce site architecture. Today’s Internet services require configurations that separate the user interface layer from business processing logic and data, for security reasons. Figure 1 shows how many e-commerce sites further isolate business logic from underlying data.
Figure 1. Multitiered e-commerce site architecture
In a multitiered configuration, such as that shown in Figure 1, client browsers access Web pages and the Web pages activate associated business logic hosted on a Web server. The persistent data that the business objects require is maintained in a separate database layer (bottom tier).
Data and state information control processing logic and client experience. The following table describes typical strategies you can use at each tier to monitor data and state information.
|User interface layer (top tier)||
|Business logic layer (middle tier)||
|Database layer (bottom tier)||
You design availability into a Web site by identifying services that must be available, determining where those services can fail, and then designing the services so that they continue to be available to customers, even if a failure occurs. There are three fundamental strategies you can use to design a highly available site:
- Ensure that operational procedures are well documented and appropriate for your goals and the capabilities of your staff.
- Ensure that your site has enough capacity to handle processing loads.
- Reduce the probability of failure.
One of the most effective means of ensuring site availability can also be inexpensive to implement. Creating well-documented and accurate operational procedures is an effective means of ensuring site availability.
Operational procedures should include the following:
- Change management. (For more information, see the white paper titled Developing Your Site.)
- Service-level management. (For more information, see the white paper titled Managing Your Site.)
- Problem management. (For more information, see the white paper titled Problem Management.)
- Capacity management. (For more information, see the white paper titled Maximizing Performance.)
- Security management. (For more information, see Designing Secure Web-Based Applications for Microsoft Windows 2000 by Michael Howard, located online at http://mspress.microsoft.com/prod/books/4293.htm.)
- Availability Management. (This white paper discusses availability issues.)
Microsoft has created a knowledge base called the Enterprise Services frameworks (Microsoft Readiness Framework, Microsoft Solutions Framework, and Microsoft Operations Framework) to describe industry experience and best practices for such procedures.
There is also a wealth of procedural best practices available in other locations. For references to additional information, see "Related Resources" later in this white paper.
When you have a stable set of operational procedures, you can begin to explore ways of improving hardware and software availability. System availability doesn’t depend only on how redundant your hardware and software systems are. All of the elements described in the following table determine availability.
Site services can become unavailable if site traffic exceeds capacity. Site services can also become less reliable after operating for prolonged periods at peak load. You can scale your server farm to accommodate increased site traffic and to maintain site performance in a cost-effective manner. For detailed information about how to scale a site, see the white paper titled Planning for Scalability.
To design a highly available site, you must understand potential causes of failure and take steps to eliminate them. The following list contains some of the more common types of failure and the elements that can cause the failure:
- Application software: Inferior code quality, vulnerability to service attacks, and platform dependencies that aren’t met
- Climate control: Malfunctioning air-conditioning units or heating units
- Data: Data corruption
- Electrical power: Malfunctioning power-conditioning units, UPSs, or generator sets
- Hardware: Degraded memory chips, and malfunctioning CPU, disk hardware, disk controllers, or power supplies
- Network: ISPs not complying with service agreements, and malfunctioning routers, firewalls, or network cards
- Security: Firewalls, networks, and Web applications not working properly, and attacks from hackers on the Internet
The following table describes failure-reduction techniques for each common type of failure.
|Type of failure||Failure-reduction techniques|
|Application||Create a robust architecture based on redundant, load-balanced servers. (Note, however, that load-balanced clusters are different from Windows application clusters. Commerce Server components aren’t designed to be aware of application clusters.)
Review code to avoid potential buffer overflows, infinite loops, code crashes, and openings for security attacks.
|Climate control||Maintain the temperature of your hardware within the manufacturer’s specifications.
Excessive heat can cause CPU meltdown and excessive cold can cause failure of moving parts, such as fans or disk drives.
Maintain humidity control.
Excessive humidity can cause electrical short circuits from water condensing on circuit boards. Excessive dryness can cause static electricity discharges that damage components when you handle them.
|Data||Conduct regular backups. In addition to regular backups, archive backups offsite. For example, you can archive every fourth regular backup offsite, to save space.
If your data becomes corrupted, you can restore the data from backups to the last point before the corruption occurred. If you also back up transaction logs, you can then apply the transaction logs to the restored database to bring it up-to-date.
Replay transaction logs against a known valid database to maintain data. This technique is also known as "Log Shipping to a warm backup server." This technique is useful for maintaining a disaster-recovery site (also known as a "hot site").
Deploy Windows Clustering.
Commerce Server uses data stores such as SQL Server and the Active Directory™ directory service. SQL Server provides access to data and services such as catalog search. SQL Server uses Windows Clustering to provide redundancy. Active Directory provides access to profile data and can provide authentication services. Active Directory uses data replication to provide redundancy.
In general, clustering is more effective for dynamic (read/write) data, and data replication is more effective for static (read-only) data.
Minimize the probability and impact of a SQL Server failure by clustering SQL Server servers or by replicating data among SQL Server servers.
If you are using Microsoft SQL Server 7.0, the full-text search feature is available only in a non-clustered configuration, so you must use a replication strategy for the product catalog. In addition, SQL Server 7.0 is not supported for high-availability configurations due to issues with Microsoft Data Access Components (MDAC) 2.6 and Windows Clustering.
Microsoft SQL Server 2000 is fully supported for high-availability configurations.
If you use Active Directory, back up Active Directory stores. (You can do this while Active Directory is online.)
Use at least two Active Directory domain controllers, with a replication schedule appropriate to your requirements. Restoring a domain controller can be time-consuming and requires that the domain controller be offline. If you have peer domain controllers, you can minimize downtime if you must restore your site from backups.
|Electrical power||Use UPSs. Because UPSs are typically battery powered, they are useful only for outages that last for short periods of time. Be sure to use a UPS that has the same power rating as your equipment.
Use power generators as secondary backups to the UPSs. You can use generators for an indefinite period of time because they are fuel powered (diesel or gasoline) and you can refuel them if necessary.
|Network||Implement network redundancy with any combination of the following:
Use multiple NICs, multiple routers, switches, LANs, or firewalls.
Contract with multiple ISPs, or set up identical equipment in geographically dispersed locations.
|Security||Contract an independent security audit firm to evaluate your environment.
Deploy intrusion-detection tools.
Deploy multiple firewalls.
For the latest strategies and techniques for handling security issues, see http://www.microsoft.com/windows2000/guide/server/features/securitysvcs.asp.
|Server||Deploy redundant, load-balanced servers. Single-IP solutions increase site capacity by distributing HTTP requests proportionally according to each server’s capacity for handling the required load. In addition, when you use the single-IP solution, you make sure that users are referred only to operating servers. There are many single-IP solutions available to help you load-balance your servers:
Microsoft Windows 2000 Advanced Server and Datacenter Server editions both provide a Network Load Balancing (NLB) service.
Microsoft Application Center 2000 provides NLB enhancements (Request Forwarder) to support many users sharing a single-IP address.
Hardware-based load-balancing solutions.
|Hardware||Deploy redundant hardware components, such as the following:
Use redundant array of independent disks (RAID) disk arrays, disk mirroring, and dual disk controllers to minimize disk failures. There are also a number of excellent third-party solutions for reducing downtime related to disk failure. For more information about third-party solutions, see the "Related Resources" section later in this white paper.
Use a redundant disk controller.
Use redundant fiber channel host bus adapters and switches (for SAN configuration). In the event of an adapter or switch failure, the backup adapter or switch provides an alternate path to the SAN.
The following table lists a number of tools and strategies for reducing downtime due to hardware failures. An X in a column means you can use the tool to prevent the indicated type of failure.
|Tool or strategy||Application||Data||Network||Server|
|Dual disk controllers||X||X|
|Dual power supplies||X||X||X||X|
|Geographically dispersed data centers||X||X||X||X|
|RAID disk arrays||X||X|
Availability is a continuum that becomes increasingly expensive as you approach 100 percent availability. You must decide what trade-offs and compromises to make to fit your budget. The following table provides a sample framework to help you calculate the benefits of implementing the listed failure prevention strategies. The numbers used in the table are only a guideline. Use your own data and judgment to create a risk-assessment table for your site.
The table lists the types of failures that can occur and the effect of the failure, followed by a calculation of the relative probability number (RPN), using the following formula:
RPN = Likelihood of occurrence x Detectability x Severity
Likelihood of occurrence is the number of times an error is expected to occur (from 1 to 10; the higher the number, the more likely the error is to occur). In the following table, the "O" column represents this value.
Detectability is the ease with which a failure can be found (from 1 to 10; the higher the number, the harder the failure is to detect). In the following table, the "D" column represents this value.
Severity is the degree to which the failure will affect the site (from 1 to 10; the higher the number, the more serious the failure and the more severe the outage). In the following table, the "S" column represents this value.
|CPU (dual)||Application processing||Server might go offline||2||4||7||56||
|CPU (single)||Application processing||Server offline||2||4||10||40||
|Drives||Application and data storage||Server offline||5||4||10||200||
|Firewall||Protection from intrusion and hacking||Information stolen, site altered, or site made inaccessible||4||4||8||128||
|Memory||Application processing||Server offline||2||4||7||56||
|NIC||Network connectivity||Server offline||4||4||8||128||
|Power supply||Power equipment||Site offline||4||4||10||160||
|RAID controller||Data storage||Server offline||2||4||10||80||
|Router/ Customer Service Unit (CSU)||Connect to partners and customers||
|SQL Server cluster||Data storage||Single-server failure, resulting in slower service||5||4||2||40||
|Switch||Connect to network||Some or all devices offline||4||4||10||160||
|Web server||Serve Web site application to customers||Site offline||5||4||10||200||
You can make the business logic layer (middle tier) more resilient by using load-balanced Web clusters to protect servers, services, and the network against failures. You can use load-balanced clusters to remove an unresponsive server from the cluster so that users won’t be directed to a faulty server, and so that the unresponsive server can be repaired. You can combine Round Robin Domain Naming System (RRDNS) with load balancing to produce a scalable and available configuration.
All the nodes in a load-balanced cluster must be on the same LAN subnet and all the nodes should refer to the same IP address. On a site with multiple Web clusters, you must configure multiple load-balanced clusters on different subnets and configure the Domain Name System (DNS) to sequentially distribute requests across multiple load-balanced clusters to increase scalability.
You can make the database layer (bottom tier) more resilient by using a combination of disk redundancy and a sound backup and restoration strategy to protect your data. You can use any of the following methods to make database services more resilient:
- Clustering. In Windows Clustering, two servers can share common data and work together as a single system. Windows Datacenter Server supports four servers (nodes) in a cluster. Each node can operate independently of the others.
- Replication. You can use SQL Server replication to ensure synchronization among your database servers. SQL Server offers replication options such as snapshot replication and transactional replication. Active Directory also uses replication to ensure redundancy.
- Warm backups. You can use a single production server to provide read/write access to data, logging all transactions. You then use SQL Server Log Shipping to transfer files to a non-production server that is continuously updated with the transaction log files.
Data must be backed up or replicated to prevent it from being accidentally deleted. The following table describes the two types of data replication.
|Type of data replication||Characteristics||Use when|
|Active||Shares a part of the workload of the primary site and is always online.||Data is extremely critical and the site must always be available.|
|Passive||Inactive until a disaster takes the primary site out of operation.||Your site can tolerate brief interruptions in data availability while the backup site comes online.|
Commerce Server uses a full range of options for protecting against network, server, and disk failures. This section describes strategies for constructing small and large highly available Commerce Server sites. You can eliminate any single point of failure by:
- Using multiple instances of servers hosting the business logic layer, fronted by a single-IP solution.
- Clustering SQL Server servers to host data.
- Replicating Active Directory (if used).
Small Commerce Server Configuration
The smallest redundant architecture for Commerce Server separates the business logic layer (middle tier) from the servers in the database layer (bottom tier). A four-server configuration has two identical servers in each tier. You put a single-IP solution in front of the servers in the business logic layer, to increase availability and to hide site complexities from users. You use SQL Server clustering for the database layer, to increase availability.
All business logic runs on both of the stateless Web servers in the business logic layer in this configuration. An architecture in which all Commerce Server business logic runs on a single server is useful because it maximizes cache usage for a user session. If a user is directed to a different Web server on the server farm, the new server merely retrieves data and the user continues shopping. Commerce Server components are persistent for back-end databases, cookies, or URL query strings, so being directed to a different physical server has minimal impact on a customer’s shopping experience.
If your site requires a Windows security context for other applications or to protect file system content, you should have at least two domain controllers for each domain on your site. It is important that both domain controllers have consistent data, so the smallest, highly available configuration that uses Active Directory for authentication requires two servers at the user interface layer (top tier), two servers for the database layer (bottom tier), and two servers for Active Directory (a total of six servers).
Active Directory uses a replication strategy to ensure consistency among domain controllers in a single domain. The Active Directory processing load is minimal relative to the processing load that the authentication filter generates, so you can also have Active Directory on the same cluster as SQL Server, if necessary. Figure 2 shows a small Commerce Server configuration.
Figure 2. Small Commerce Server configuration
Large Commerce Server Configuration
The simplest way for Commerce Server sites to increase capacity is to add Web servers and move component databases onto separate database servers. You then place a single-IP solution in front of the Web services of the business logic layer (middle tier), to increase availability and to hide site complexities from customers. You can use a combination of SQL Server clustering or SQL Server replication to increase the availability of the database layer.
In a large Web farm, all business logic can also run on the stateless Web servers. However, in some environments, you might want to isolate some functions from the site shopping functions.
For example, you might prefer to run pipeline components on a separate Web farm, if the components are particularly CPU intensive (such as components that encrypt large files) or if they require access to secure data (such as components that compute insurance rates while processing social security number, credit history, or medical reports). An example of a CPU-intensive function that you might want to run on a separate server is the generation of analysis models.
You can distribute Commerce Server databases across multiple servers to increase capacity. You should base your availability decisions on the expected usage of each database.
The following table defines typical availability measurements, to help you decide what level of availability you need for your site.
|Availability target||Seconds of downtime||Downtime per incident (assuming four incidents per year)|
|99.9999%||31.536 (approximately ½ minute)||7.5 seconds|
|99.9990%||315.36 (approximately 5 minutes)||1.25 minutes|
|99.9900%||3153.6 (approximately 1 hour)||15 minutes|
|99.9000%||31536 (approximately 9 hours)||2.25 hours|
The following table describes the usage profile and impact of failure for each Commerce Server feature. Note that the availability targets shown in the table are typical examples. You might choose different availability targets for your site.
|Feature||Usage profile||Failure impacts||Availability target|
||Shopping not available||99.99%|
||Degraded browsing experience||99%|
|Data Warehouse cubes (data structures)||
||Reports not available||99.9%|
|Data Warehouse (import)||
||Data not current||99%|
||Customer authorization not available; shopping by anonymous users only||99.9%|
||Degraded shopping experience||99%|
||Shopping not available||99.999%|
|User Profile Management||
||Customer authorization not available; no personalized content; shopping by anonymous users only||99.9%|
You can use SQL Server or Active Directory for authentication on a large site. Either can support shopping by anonymous users. Figure 3 shows an example of a large site configuration.
Figure 3. Large Commerce Server configuration
Figure 3 shows a single Active Directory domain with two domain controllers. The database layer (bottom tier) is shown as being largely run on Windows Clustering. However, this limits you to four nodes in a cluster. If you require more capacity, you should consider the following strategies for the database layer:
- Catalog. Add servers to a single-IP solution. Synchronize data by using SQL Server replication.
- Order Form. Create multiple data partitions and distribute transactions across the partitions. For example, you can implement code to hash a shopper’s globally unique identifier (GUID) to one of the partitions and then store the shopper's order form on that partition.
- Profile store. Create multiple data partitions and distribute user data across the partitions by using the Profile Manager’s hashing scheme. Note that with this architecture, you should create partitions when you set up the site. If you add partitions to a site with existing user profiles, you have to unload and reload user data to be sure that it is stored in the proper partition. Partitioning is useful for both SQL Server and Active Directory profile stores.
Scaling for Active Directory requires a multi-forest design with trusts established from the domains in the user forests to the domain in the resource forest. For more information about Active Directory, see http://www.microsoft.com/windows2000/.
Commerce Server Component Design Considerations
The following table contains examples of uptime requirements for Commerce Server functions. Your requirements might be different from those shown in the table; however, you can use the examples as starting points for creating your site architecture.
|Commerce Server function||Target uptime||Failure classification|
|Authentication||99.99%||One retry acceptable|
|Basket view||99.999%||No acceptable failure rate|
|Catalog query||99.99%||Two retries acceptable|
|Purchase pipeline||99.999%||No acceptable failure rate|
|Show advertisement||99%||Don’t retry|
For examples of business logic that you can use to protect against loss of database connectivity, see "Retry Code Logic" later in this white paper.
You use the Commerce Server Administration database to store configuration information for Commerce Server resources, including global resources shared by all sites, in addition to site-specific resources. Connections from objects used to access the database are short-lived and the Global.asa file caches data for later reference. After information is cached, the database connection is no longer needed.
You can make the Administration database redundant with Windows Clustering. You can use either of the following types of configurations to provide redundancy:
- Active-active configuration. All nodes are running and hosting a database.
- Active-passive configuration. One node is running and hosting a database; all other nodes are idle (in backup mode), ready to run and host the database if the active node fails.
When the IIS application is loaded, the Application_OnStart subroutine is called to load the Administration database into the cache. The Application_OnStart subroutine loads application variables by using the SiteConfigReadOnly object, which requires a connection to the Administration database.
If the Application_OnStart subroutine fails (for example, if the SQL Server server is failing or simply is not available), you must do one of the following to reload the application variables:
- Run an ASP page that runs all initialization performed by the Global.asa file. This is the least intrusive option. You should use access control lists (ACLs) to protect the ASP page so that only authorized personnel or processes can run it.
- Unload the Commerce Server application from IIS Manager.
- Run the IISReset command; this option is intrusive because it resets all ASP applications running on IIS server.
- Restart the Web server; this is the most intrusive option, because it affects all applications and services running on the Web server.
You can script all of these options.
You can use the Commerce Server Profiling System to aggregate data from multiple data stores, such as Active Directory (Lightweight Directory Access Protocol [LDAP] version 3.0), SQL Server, or Microsoft Site Server 3.0 Membership Directory (LDAP version 3.0). This enables you to store data in the data store that’s most appropriate for the usage profile. Directory services are usually optimized for read-only operations, because of inherent security features. SQL Server is optimized equally well for read and write operations, but has fewer inherent security features. You should use Active Directory to store data that is mostly static throughout a user’s visit to your site, or data that must be secured with ACLs. Use SQL Server to store data that is volatile throughout a user’s visit.
For example, Active Directory is based on a hierarchical LDAP model, which is excellent for fast, single-item retrieval. Attributes stored in Active Directory are easily available to other applications that might be running in the data center. Examples of attributes that might best be stored in Active Directory are user name, address, and city.
In contrast, Commerce Server–specific attributes that are volatile, such as date of last purchase, should be stored in a SQL Server database to take advantage of the capacity of SQL Server to handle high rates of updates. Data availability can vary significantly, depending on what combination of data stores you use to maintain the data.
To generate a large-scale, highly available User Profile Management back end, you should use a hash-based partition cluster of SQL Server servers based on Windows Clustering. Hashing assigns a user to a particular User Profile Management partition. Figure 4 shows how you might set up a hash-based partition cluster of SQL Server servers.
Figure 4. Hash-based partition cluster of SQL Server servers
Each User Profile Management server instance shown in Figure 4 is actually running Windows Clustering, as shown in Figure 5. Figure 5 shows Windows Clustering in an active-passive configuration with a shared disk.
Figure 5. Windows Clustering configuration
Note If you use the partitioned architecture described in this section, create partitions before you begin. If you add partitions after your site contains user data, you will have to remove and restore user data to be sure that it is stored in the correct partition.
The Profiling System opens a database connection, reads configuration information, and builds a profile configuration cache from that information. Connections to the database are pooled and have built-in retry logic if a connection is lost.
Integrating User Data from Third-Party Data Stores
You can use the Profiling System to integrate data from third-party stores (such as data residing in an Oracle database). Depending on what you want to do, it might be cost-effective to set up a highly available configuration for the third-party data store. Be aware, however, that if a read of any of the underlying data sources fails, the entire operation fails. You can set up redundancy for third-party data stores by creating a virtual server from the multiple data stores. The data source name (DSN) that the Profiling System uses should then be established with the virtual server.
Aggregating User Data from Multiple Stores
Whenever you must place data into different data stores, there is a risk that one or more of the stores will be unavailable. Figure 6 shows an unavailable data store.
Figure 6. Unavailable data store
To protect against this situation, the Profiling System supports the notion of a loose transaction. A loose transaction is one in which writes and updates to transacted stores (such as SQL Server and Oracle) aggregated by the Profiling System are committed only if operations to non-transacted stores (Active Directory) succeed.
The following logic example describes how Commerce Server processes loose transactions in a site in which Active Directory, SQL Server, and Oracle stores are all used to store user profile attributes. It is enabled by the isTransactioned attribute of the profile definition. The logic example is:
Start transaction for SQL Server Write changes Start transaction for Oracle Write changes Write changes to Active Directory_1 Commit changes_2 Commit changes_3 If 1 fails, then 2 & 3 abort. If 1 succeeds and 2 fails, then Active Directory store is inconsistent. If 1 & 2 succeed, 3 fails, then SQL Server is inconsistent.
The Profiling System does not support two-phase commits.
Product Catalog System
In the Product Catalog System, you can make the Catalogs database redundant by installing it on multiple Web servers fronted by a single-IP solution, such as the solutions described for the following site configurations:
- Small site. Use either active-active or active-passive clustering. This configuration will support two nodes running Windows 2000 Server or four nodes running Windows 2000 Datacenter Server.
- Large site. Use NLB to cluster SQL Server servers and to provide a single virtual database name for the Catalogs database business logic to access.
Alternatively, you can use SQL Server replication (either snapshot or transaction, depending on your requirements) to maintain data consistency among SQL Server servers. Figure 7 shows how a configuration using NLB might look. In practice, NLB has been tested with a cluster containing as many as eight SQL Server servers.
Figure 7. Catalogs database cluster using Network Load Balancing
- Very large site. Put a single-IP solution in front of multiple NLB clusters. Figure 8 shows what this configuration might look like.
Figure 8. Catalogs database for a very large site
The Product Catalog System maintains a continuous connection to database tables. If the connection is lost due to network or database problems, the Product Catalog System will attempt to reestablish the connection.
The Targeting System business logic caches all of its working information, including discounts, advertisements, shipping information, and tax information. It caches the Predictor resource Binary Large Object (BLOB). A refresh timer, specified programmatically by the RefreshInterval property of the CacheManager object, controls the frequency with which CacheManager refreshes a particular cache object. A connection to the database is created each time the cache is refreshed. An existing cache is not discarded until the refresh of the cache has successfully completed. If you absolutely must have an up-to-date cache (or when you start up and there is no cache), you can implement retry logic if a database is unavailable during a refresh operation.
You can install the Targeting System on multiple Web servers and make it highly available by using a load-balanced cluster of SQL Server servers. Figure 9 shows the smallest database solution that you can use to get the clustering results you want.
Figure 9. Clustering the Targeting System
Commerce Server Direct Mailer is a batch process, so it has lower priority than the production components of Commerce Server, such as the Transactions database. Direct Mailer has a higher tolerance for unplanned outages than other Commerce Server features. However, you must manually restart Direct Mailer after a SQL Server server failure to continue from the point at which it was interrupted.
Direct Mailer depends on the SQL Server Agent to run its scheduled jobs, so you must run Direct Mailer on a server that is also running SQL Server. The SQL Server Agent runs a command line that activates a process to create and start a job. That process then runs another service that does the work. If the active Direct Mailer database server fails, the connection will be reestablished with the newly promoted backup SQL Server server when the next SQL Server Agent job starts.
The SQL Server Agent is cluster-aware, so you can install the Direct Mailer database in a Windows Clustering clustered configuration. Because of the architecture of Direct Mailer, you must use an active-passive configuration. Direct Mailer must be installed on the same server as its database, so in a clustered configuration, you must install Direct Mailer twice. Although you can have only one Direct Mailer per site (as the Direct Mailer global resource specifies), you can install Direct Mailer on multiple servers.
You must install the Direct Mailer database on a local instance of SQL Server. You can use SQL Server 2000 local instance setup to do this. After you have installed the Direct Mailer database, you can move it to the virtual cluster SQL Server instance.
Direct Mailer opens a database connection at startup to recover any interrupted jobs. It also opens a database connection for each job. If the database connection is lost, the executing job will fail. Jobs that did not run because Direct Mailer was not working must be manually restarted when the Direct Mailer restarts.
Each job tracks its progress through a mailing list. However, depending on when a failure occurs, the last message sent before the failure can be sent a second time when Direct Mailer restarts. You can configure Direct Mailer to use multiple threads, and it is possible that one duplicate message for each thread will be sent when the Direct Mailer restarts. (The number of duplicate messages is usually equal to as many threads as there are processors.)
Business Process Pipelines
You can install pipeline components on multiple Web servers and make them highly available by using a single-IP solution. With NLB, you can create a single virtual server from all the servers on which pipeline components are installed. If there is a problem, the server that is not working is automatically removed from the cluster and the load is distributed to the functioning servers in the cluster. Figure 10 shows how you might distribute pipeline components.
Figure 10. Distributed pipeline components
If a pipeline component is CPU intensive, you can move it to a separate server. There is an expense for sharing data between servers, but that expense might be acceptable if the component will do a great deal of processing. For example, a digital media site might encrypt music with a user’s public key to prevent piracy. Encryption is CPU intensive, so it could be worthwhile to move that processing to a separate server. If you did so, you could cluster the servers where the encryption component is running and activate them by using one of the Web servers running other parts of the pipeline. Figure 11 shows how you might separate CPU-intensive services from other services.
Figure 11. Separating CPU-intensive services
You use Component Object Model + (COM+) pipeline components to access the Order Processing Pipeline (OPP). Components are pooled, but there is an initialization, reset, and recycle interface that is called when the OrderForm object is recycled into the pool. You can use that interface to release connections, free memory, and so on, when the OrderForm object is returned to the pool, and to initialize database connections when components are requested.
The QueryCatalogInfo component behaves differently. QueryCatalogInfo retrieves product information by invoking a CatalogManager object passed as part of the Context object. The CatalogManager object holds a connection to the database. If the connection is lost, CatalogManager will reestablish the connection to the database.
The Data Warehouse has two components: offline and online. These two components have very different availability requirements. A very large database that contains imported raw data should be unavailable for querying (offline) while data is being imported. If necessary, it can be clustered. Alternatively, if you don’t require a high degree of availability for the database, you can back it up on a regular basis to tape or other permanent media.
Online analytical processing (OLAP) cubes are data structures that the Data Warehouse uses to contain the data you import. (For more information, see Commerce Server 2000 Help or SQL Server Books Online.) Figure 12 shows how you can make OLAP highly available by using a clustered configuration. If a node in the cluster fails, the other node will become the primary node.
Figure 12. Clustering the Data Warehouse
The Data Warehouse opens a database connection for each import and parser Data Transformation Services (DTS) task. The connection is maintained for the duration of the process and then dropped. If the database becomes unavailable during the process, the process will fail, but you can restart it. If a failure occurs during import processing, you must manually delete any data that was only partially imported and restart all import DTS tasks. If a failure occurs during cube generation, you must manually restart the post-import DTS task.
Windows 2000 is designed to be highly available. It contains numerous improvements that provide increased reliability, including Windows file protection, driver certification, application certification, kernel-mode write protection, application DLL protection, a recovery console, and safe start mode. In addition, key services like IIS and NLB have been substantially enhanced to support highly available architectures.
Several types of failures, such as "system hangs", "no-logon," and "network dead," can be attributed to resources running consistently above healthy performance thresholds. "Blue screens" and application failures have also been experienced in certain disk-full conditions. To mitigate these failures on a highly available Windows 2000 platform, you should monitor the system thresholds listed in the following table.
|Metric||30-day average||Four-hour average||Peak|
|Average available bytes||10 MB||5 MB||5 MB|
|Disk queue length||1 per non-redundant spindle||2 per non-redundant spindle||3 per non-redundant |
|Network interface using |
Network Load Balancing
NLB increases availability by redirecting incoming network traffic to working cluster hosts if a host fails or is offline. As a result, even when existing connections to an offline host are lost, Internet services remain available. In most cases, client software automatically retries the failed connection, and the clients experience only a few seconds’ delay in receiving a response. NLB is a feature of Windows 2000 Advanced Server and Windows 2000 Datacenter Server.
Application Center provides enhancements to NLB by enabling NLB to run on Windows 2000 Server, by providing operations management and monitoring tools, and by providing the Request Forwarder component to facilitate load balancing of customers behind proxy server farms.
When a user accesses a site through a proxy server, the user might appear to NLB to have different IP addresses; even when session affinity is enabled, this might cause the user to be directed to different servers. If your site uses session state management at the physical server, this might cause problems. If you use the Request Forwarder in Application Center, the user connection can be forwarded to the server where the user is recognized, regardless of where it is assigned in the server farm.
Web Farm/Active Directory Authentication
When a user accesses a site with multiple Web servers, the request is directed to a particular server, based on some method of load balancing or a round-robin algorithm. When the request arrives at the server, the user is asked to log in.
If you are using Active Directory for authentication, the login is cached by the Internet Server Application Programming Interface (ISAPI) filter and is specific to that server. Commerce Server also places a ticket cookie containing the user ID onto the client server. To enable users to be seamlessly redirected to other Web servers in the farm, you must ensure that sufficient information can be passed to the ISAPI filter on the other servers so that the user can be logged in. You must write custom code to hide this process from the user. To write the custom code, do the following:
- Extend the Profiles store to store the password.
- Capture the password in the Profiles store during login or site registration.
- Modify the site registration or login page to check for the presence of the ticket cookie.
If the cookie exists, use the MSCSAuthManager object to get the user ID from the cookie and retrieve the username:password from the Profiles store, then pass the username:password back to the ISAPI filter.
To operate a secure data center, you must store the passwords with reversible encryption. Because passwords are captured at the login page and written to the Profiles store, the login page also captures password changes. (However, if Active Directory tools are used to change a password during a browser session, the user will be prompted for login and the new password will be captured.)
You can address the previous authentication issue by deploying Application Center Request Forwarder, which can forward the request prior to authentication being assessed and can then forward the user to the server that will recognize the user.
Although you need only a single domain controller for each domain, a single domain controller can become a single point of failure. Instead, you can add additional domain controllers to a domain to increase its availability. Active Directory uses a two-way replication strategy to ensure consistency among domain controllers in a single domain. Active Directory domain controllers support multi-master replication to synchronize data on each domain controller and to ensure consistency of information.
The information stored in Active Directory on every domain controller (whether or not it is a global catalog server) is partitioned into three categories: domain, schema, and configuration data. Each of these categories is in a separate directory partition (also called a Naming Context). The directory partitions are the units of replication. The domain data partition contains all of the objects in the directory for the domain. Data in each domain is replicated to every controller in the domain, but not beyond.
If the domain controller is a global catalog server, it also contains a fourth category of information: a partial replica of the domain data directory partition for all domains. This partial replica contains a subset of the properties for all objects in all domains. (A partial replica is read-only; a complete replica is read/write.)
By default, the partial set of attributes stored in the global catalog includes those attributes most frequently used in search operations, because one of the primary functions of the global catalog is to support clients querying the directory. Using global catalogs to perform partial domain replication instead of doing full domain replication reduces WAN traffic.
Figure 13 shows replication within a site. Three domain controllers (one of which is a global catalog) replicate schema data and configuration data, in addition to all directory objects (with a complete set of each object’s attributes).
Figure 13. Replication within site domains
Active Directory attempts to establish a topology that allows at least two connections to every domain controller so that if a domain controller becomes unavailable, directory information can still reach all online domain controllers through the other connection. Active Directory also automatically evaluates and adjusts for changes in the state of the network. For example, when a domain controller is added to a site, the replication topology is adjusted to efficiently incorporate the new addition.
Replication Between Sites
You can also use Active Directory to optimize both server-to-server and client-to-server traffic over WAN links. Having multiple sites can provide redundancy in the event of a geographical disaster. Best practices for setting up multiple sites include the following:
- Set up a site in every geographic area that requires fast access to the latest Active Directory information.
- Place at least one domain controller at every site and make at least one domain controller in each site a global catalog. Sites that do not have their own domain controllers and at least one global catalog depend on other sites for directory information and are less efficient.
Every disaster recovery plan must include backup and restoration strategies. When a domain controller fails, either due to environmental hazards or due to equipment malfunction, you should first repair the domain controller itself and then recover the data. Active Directory is able to recover lost data because:
- The database uses log files to recover lost data.
- The directory service uses replication to recover data from other servers in the domain.
There are a variety of tools that you can use to repair the domain controller and recover Active Directory. For more information about Windows 2000 disaster protection, including backups, restores, and repairs, see the Windows 2000 Server Operations Guide. Also, see Windows 2000 Server Help.
You can back up directory data and configuration files to traditional media, such as magnetic tape, or to a separate local or network disk drive. Directories are frequently replicated to provide load balancing, high availability, and localized access in a distributed environment.
It’s important to understand the implications of restoring a replica of a directory from tape. Restoring from tape takes a long time, and backed-up data is only current as of the time the backup was created.
In most cases, it is better to rebuild a directory from peer replicas, because the data in the peer replicas is already online and the data is always current.
There are two methods of restoring Active Directory:
- Non-authoritative restore (the default). Use when at least one other domain controller in the domain is available and working. After a non-authoritative restore, Active Directory replication automatically begins propagating any changes from other domain controllers that occurred after the time of the backup.
- Authoritative restore. Use only when you have accidentally deleted critical data from the local domain controller and the delete has propagated out to other domain controllers.
Active Directory Monitoring Tools
Active Directory provides the monitoring tools described in the following table.
|Active Directory monitoring tool||Use to monitor|
|Event Log||Information about hardware, software, and system problems. You can use Event Logs to aggregate system uptime by monitoring a combination of system startup, “clean” shutdown, and "dirty" shutdown events.|
|Performance Logs and Alerts||Performance counters contained in categories of objects. You can configure Performance Logs and Alerts to alert you when designated thresholds have been exceeded. You choose the criteria you want reported and the manner in which you want it reported to you.|
|Replication Monitor||Low-level status and performance of replication between Active Directory domain controllers. Replication Monitor is available in the Windows 2000 Resource Kit.|
This section describes three strategies that you can consider using to create a highly available database layer (bottom tier):
- Warm backup
This section also provides examples of retry code logic. For more information about SQL Server availability, see SQL Server Books Online.
Two or more SQL Server servers sharing common data form a cluster of servers that can work together as a single server. Each server is called a node and each node can operate independently of the other nodes in the cluster. Each node has its own memory, system disk, operating system, and subset of the cluster’s resources. If one node fails, another one takes ownership of the failed node’s resources. The cluster service then registers the network address for the resource on the new node so that client traffic is routed to the new server. When the failed server is brought back online, the cluster service can be configured to redistribute resources and client requests.
The following table lists three techniques you can use to make disk data available to more than one server.
|Shared disk||Although no longer requiring expensive cabling and switches, the shared-disk technique, in which multiple servers share the same disk, still requires specially modified applications using software called a Distributed Lock Manager (DLM).|
|Mirrored disk||More flexible than the shared-disk technique, the mirrored-disk technique is based on each server having its own disks, and using software that "mirrors" every write from one server to a copy of the data on another server. This technique is very useful for keeping data at a disaster-recovery site synchronized with a primary server.|
|Shared nothing||In a shared-nothing architecture configuration, each server owns its own disk resources. If a server fails, a shared-nothing cluster has software that can transfer ownership of a disk from one server to another. This technique provides the same high level of availability as shared-disk clusters, with potentially higher scalability, because it does not have the inherent bottleneck of a DLM.|
SQL Server 2000 provides improved interoperability with Windows Clustering. Prior to SQL Server 2000, SQL Clustering used three rebinding virtual DLLs interjected between every SQL Server or client component and the corresponding kernel DLLs. SQL Server 2000 also contains the following improvements over previous versions:
- Clustering is much easier to manage. It is implemented through the use of SQL Instancing, which removes the need for the virtual mapping layer so that you can directly manage individual components in the cluster environment.
- All SQL Server tools are cluster-aware, which makes the environment significantly more robust. The tools are now integrated, which eliminates the need to coordinate the use of multiple tools. Previously, when you had to use several tools to manage a cluster, any misuse or lack of coordination in using the tools could cause the cluster to fail.
- Clusters can contain up to four nodes (instead of two in previous versions).
- You can now perform rolling upgrades on nodes in a Windows Clustering cluster. Prior to SQL Server 2000, you had to take the entire cluster offline to perform an upgrade. With SQL Server 2000, you can take just one node offline at a time to perform the upgrade, leaving the other nodes online.
- Full-text search is now cluster-aware.
- OLAP can be clustered when you use it with SQL Server 2000. OLAP by itself is not cluster-aware.
You can use SQL Server replication to synchronize your site’s database servers. SQL Server replication options include snapshot replication and transactional replication. Consider using SQL Server replication to do batch updates if your application logic is aware of the nature of database access.
For example, you can update the Product Catalog System in Commerce Server primarily in batch mode and have all client access be read-only. Another example is creating and changing the OrderForm object a few times from known pipeline components. OrderForm objects are frequently read from other pipeline components in the site during check-out, for operations like tax and shipping calculations. In both examples, you might consider having a small number of stores updated with reads directed to a bank of read-only servers.
You should also consider using SQL Server replication to keep geographically distributed servers synchronized. Figure 14 shows how you might cluster SQL Server servers in the database layer.
Figure 14. Clustering SQL Server servers
If you place NLB in front of a group of read-only SQL Servers, you can use the same server name for the entire cluster. The benefit of doing this is that you can then remove any server in the cluster from service with minimal impact to users. You can also enable replication from the online transaction processing (OLTP) environment to the cluster. Use transactional replication if you require higher transactional consistency to minimize latency between the OLTP server and the load-balanced, read-only servers.
Consider the following for installations of NLB query clusters:
- With any installation of NLB, there are at least two dedicated IP addresses (one per server) and one virtual IP address. For replication between the OLTP environment and the cluster to work correctly, replication must always be made directly to the dedicated IP addresses (or individual server names) and never be made directly to the virtual IP address or to the cluster name. Because you use NLB to balance the load across the cluster, replicating directly to the virtual IP address (or cluster name) would cause transactional inconsistency between servers in the cluster.
- Although you can use either the push or pull metaphor for this configuration, the best practice is to push all replication to servers in the cluster. The reason for using the push metaphor is that you can manage the distribution jobs centrally and be alerted if one of the servers in the cluster stops responding. (The replication job will fail for the server that stops responding.)
- Because replication is occurring between the OLTP environment and the cluster, you can’t set databases on servers in the cluster to read/write. As a result, you must use login permissions to maintain read-only security. All client connections to servers in the cluster must connect through the virtual IP address and have read-only permissions. It is also necessary to ensure that all servers in the read-only cluster include the same subset of logins.
If you want to apply a snapshot to subscribers in the cluster, the best practice is to disconnect servers individually from the cluster, apply the snapshot, and then reconnect each server to the cluster. This produces maximum availability and maintains transactional consistency between servers in the cluster.
If a single production server provides both read and write access to data, you should log all transactions. You can use SQL Server Log Shipping to transfer files to a non-production server that is continuously updated with the transaction log files. This technique is called warm backup because the backup database is always "warmed up" by continuous application of the transaction logs. This option is inexpensive and easy to manage, and provides a strategy for availability in environments where there is some tolerance for downtime.
You can use this method in combination with other strategies described in this section to recover from disaster when availability is particularly critical. You can also use warm backups to keep geographically dispersed data centers reasonably synchronized. In fact, this is an inexpensive way to maintain a synchronized database in a separate data center for site disaster recovery.
Combining NLB and SQL Server Log Shipping provides excellent warm failover capability. In this scenario, you can use NLB to provide the same IP address to at least two servers. This makes it easier to fail over to the secondary SQL Server server if the primary SQL Server server fails. You direct client applications through the NLB cluster to one of the SQL Server servers. If a failure occurs, or when it is time for scheduled downtime, you can use NLB to transfer access to another SQL Server server in the cluster. Log Shipping synchronizes the servers, based on the log’s shipping frequency.
There are some things you need to consider when you set up NLB and Log Shipping:
- In order to set up Log Shipping within a NLB cluster, the SQL Server servers must communicate over a private network that is isolated from the NLB network. To create the private network, you must install a secondary NIC into both of the SQL Server servers.
- You must set up Log Shipping with a hard IP address instead of using universal name convention (UNC) server names, or do one of the following:
- Create an Lmhosts file to resolve the IP addresses to a UNC server name.
- Register the IP addresses in Windows Internet Naming Service (WINS) to resolve the UNC name, if a WINS server exists on the private network.
- When you set up the SQL Server servers in the NLB cluster, the primary SQL Server server is connected to the cluster. You must then disconnect the secondary Log Shipping SQL Server server from the cluster. The servers stay synchronized by having the primary server "log ship" the data to the secondary server.
- If you don’t isolate the communications between the SQL Server servers, NLB will attempt to communicate directly to the IP address on the second NIC card, causing looped traffic and other problems.
If the primary SQL Server server fails, NLB can automatically failover to a secondary server. Depending on the Log Shipping frequency, however, the best practice is to make the decision to fail over to the secondary server manually. For example, if you have set Log Shipping to run every five minutes, the secondary server could be as much as five minutes behind (assuming all processes are functioning properly). By failing over the system manually, you can check to see if the latest transaction log can be applied so that more data can be saved, instead of failing over automatically and perhaps losing up to five minutes of data.
You install stored procedures to facilitate failover when you install SQL Server Log Shipping. For more information about Log Shipping, see SQL Server Books Online.
Retry Code Logic
This section contains examples of retry logic to protect against the loss of database connectivity:
- One retry level
If (getAuth == error) Sleep 1 second GetAuth again Response.write (“Sorry, logins are disabled. Please e-mail questions to mailto:helpdesk.”)
- Two retry levels
If (getcatalog == error) Sleep 1 second Getcatalog again If (failed again) Response.write (“I’m retrying your query; please wait.”) Sleep 10 Getcatalog again Response.write (“Sorry, our catalog is temporarily unavailable. Please e-mail questions to mailto:helpdesk.”)
- Don’t retry
if (getDiscount == error)
Information about highly available solutions is available on many third-party Web sites. For more information, you can search for "high availability" or "high-availability" on any of the following Web sites:
Information about fault-tolerant systems is also available on some third-party Web sites. For more information, you can search for "fault tolerant" or "fault-tolerant" on any of the following Web sites:
The information contained in this document represents the current view of Microsoft Corporation on the issues discussed as of the date of publication. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information presented after the date of publication.
This White Paper is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, AS TO THE INFORMATION IN THIS DOCUMENT.
Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under copyright, no part of this document may be reproduced, stored in or introduced into a retrieval system, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or otherwise), or for any purpose, without the express written permission of Microsoft Corporation.
Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in this document. Except as expressly provided in any written license agreement from Microsoft, the furnishing of this document does not give you any license to these patents, trademarks, copyrights, or other intellectual property.
The example companies, organizations, products, people and events depicted herein are fictitious. No association with any real company, organization, product, person or event is intended or should be inferred.
© 2001 Microsoft Corporation. All rights reserved.
Microsoft, Active Directory, MSN, and Windows are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries.
The names of actual companies and products mentioned herein may be the trademarks of their respective owners.