Protecting Privacy in the Development and Testing of Human Resources Applications
Published: November 2010
Internal policies and governmental regulations dictate how Microsoft must handle and store personally identifiable information (PII) about employees. The Human Resources Information Technology (HR IT) team at Microsoft created a strategy for using live data feeds from applications that, by function, contain PII. HR IT can test the functionality of the technological solutions that it is developing and testing under real-world conditions without risking inadvertent disclosure of PII.
Article , 107 KB, Microsoft Word file
Products & Technologies
This paper is intended for technical decision makers and IT professionals who are interested in learning how to help keep sensitive data safe but still allow it to be used during development and testing.
HR IT must access live HR systems to develop and test the custom applications that it builds. To help ensure that the data that HR IT uses in the development and test environment is as safe as it is in the production environment, HR IT developed a strategy for sanitization (removal) of PII from this data. This article details the strategy, discusses how the team piloted and tested the data sanitization solution, and describes techniques that the team uses to help ensure the success of the sanitization process.
To begin the data sanitization project, HR IT needed to answer the following questions:
What data is PII or may be associated with PII?
What is the correct tool or solution to sanitize data when multiple applications rely on specific data points and key data elements?
Will the sanitization strategy affect development or disrupt the testing framework?
Will the sanitization strategy create overhead in terms of time and performance?
Identifying Personal Data
Identifying data that contains PII can be complex. Although some data values are obviously PII, such as last names, Social Security numbers (SSNs), driver’s license numbers, and financial information, the attributes of other data points may be less clear. Information such as hire dates, job titles, and leave dates may not appear to be inherently PII when taken as a single value, but when used in combination, they could reveal an employee’s identity.
For example, if a malicious user of Microsoft systems knows that a leave of absence occurred on certain days, he or she could use the date for the beginning of the leave and the date for the end of the leave, together with items such as profession, discipline, and job title, to identify the individual. The malicious user could then use this information to associate information like stock level, salary, and other PII. HR IT had to determine what combination of data points it should sanitize to make sure that a malicious user cannot infer an identity.
Choosing the Sanitization Strategy
The HR IT team reviewed three strategies for sanitizing data.
Sanitize PII and non-PII. This approach includes sanitizing all data points (PII data points and non-PII data points that might lead to identification of an individual). However, many business rules may be associated with non-PII points, and the sanitization of them might make testing the business scenarios impossible. For example, an employee’s job title is used in several stages of HR IT applications, such as applications for recruiting, staffing, and performance management. If this information is sanitized, it will cause tests that rely on it to fail.
Sanitize only PII data . This approach sanitizes only high business impact (HBI) PII data points such as SSN, personnel number, license number, and e-mail address, but does not sanitize non-PII data points such as job title, work phone, and salary. Sanitizing HBI and PII data points reduces the chance of making an association of personal information with the database record values. However, this approach still carries risks. For example, if the chief executive officer (CEO) of the company earns the highest salary, a malicious user can simply look for the highest salary or job title to identify the record of the CEO determine other details. An organization can mitigate this risk by running scripts to eliminate these special cases from the data.
Sanitize non-PII data values . This approach works when business logic exists on the primary key that contains PII data points. With this approach, all data points are sanitized, except the primary-key PII data point. For example, an application may have business logic related to SSNs, so the SSN cannot be sanitized. In this scenario, the remaining data points like first name, full name, and last name can be sanitized, so the SSN cannot be connected to an identifiable person. The drawback to this approach is that the high volume of data that must be sanitized may affect performance.
HR IT chose the strategy of sanitizing non-PII data values for relevant applications. Advantages to this approach include the preservation of data relationships and constraints. Additionally, because the data points used for external systems are not sanitized, external application interfaces such as SAP continue to work as expected.
Choosing the Sanitization Technology
There are many ways to sanitize PII, such as scrambling data values, using data substitution methods, masking data, clearing data, shuffling records, and employing encryption and decryption. Hardware encryption is another method for securing data. Unfortunately, hardware encryption cannot occur on virtual servers.
Multiple third-party tools are available to sanitize data, and custom tools can be developed through application programming interfaces (APIs) for hashing algorithms. HR IT chose to use a Microsoft internal tool for several reasons, including the ability to use the stronger SHA-1 hashing algorithm and to customize the workflow of the sanitization process.
The Microsoft Data Sanitization Tool is an internally used application built on the Windows Server® 2003 operating system, Internet Information Services (IIS), and the Microsoft® .NET Framework. The Data Sanitization Tool uses a three-part workflow with pre-sanitization, sanitization, and post-sanitization phases. A three-part process helps ensure that the administrator can control the fields to be sanitized and can rebuild data relationships and constraints.
In the pre-sanitization phase, the Data Sanitization Tool scans the database schema and creates XML files that represent the schemas for each database to be included in the sanitization. The administrator then chooses the data points to be sanitized and begins the sanitization phase. The sanitization phase removes all indexes, primary and foreign keys, triggers, and default constraints from the tables being sanitized. In post-sanitization phase, the administrator rebuilds the indexes, keys, triggers, and constraints.
Implementing and Testing the Solution
The HR IT team used a four-phased process in implementing its data sanitization solution: envisioning, triage, implementation, and stabilization.
Envisioning. In this phase, a technical privacy manager (TPM) provided the Data Sanitization Tool to the adoption team. The TPM also explained what information would be required to start the process. For example, the adoption team needed to supply the name of the databases to be sanitized, the frequency of sanitization, cross-database dependencies, and target location.
Triage . In this phase, the TPM and one lead person from each discipline (development, testing, and project management) met and decided on the PII data points to be sanitized, as well as the hashing methods to be applied on the PII data points.
Implementation . This phase began after the PII points, hashing method, and cross-database dependencies were identified. The same hashing key must be used to sanitize across environments. The HR IT team used the Data Sanitization Tool and then developed an application wrapper that calls the Data Sanitization Tool executable files and completely automates the sanitization process. The team used two internal applications, a leave-of-absence reporting tool and a career management roadmap, to pilot the data sanitization solution before applying it to all relevant HR applications at Microsoft.
Stabilization. In this phase, the TPM or operations team member who has permissions to view unsanitized data is able to compare the sanitized and unsanitized data by using tools that analyze the sanitized and unsantized data. These tools validate that the data sanitization has been properly implemented and all the identified PII data points had been sanitized. Then, the test team received the sanitized database to do more vigorous testing to make sure that data integrity and the user interface (UI) were maintained after sanitizing. If the test team found that the test scenarios failed, they consulted with the TPM to do triage and evaluated another way to sanitize the data that would not disrupt the testing scenarios.
Testing the Results
HR IT found that several generic high-priority test cases should be part of the testing kit for the sanitization process. The HR IT team is working on eventually automating the testing application process to cover all of the following scenarios:
Field overflow. Data type and data length should be preserved before and after sanitization.
User interface. The UI should not look distorted after the data is sanitized, and string spaces must be preserved for the values that are sanitized.
Domain integrity at the column level. If a field column such as first name appears many times in a table, every time scrambling occurs on the repeated data point, it should produce the same sanitized value. For example, if "SMITH" is scrambled to "SSSKKKK," all rows of that table should produce "SSSKKKK."
Entity integrity at the row level. Integrity must be maintained at the row level—for example, a first name, last name, and full name. If the first name is scrambled to “XXX” and the last name is scrambled to “YYY,” the full name should be scrambled to “XXX YYY.”
Referential integrity. A self-referential key must be maintained when the data value references itself in another column of the same table. Passing the same hashing key while running the sanitization process helps maintain data integrity when primary-key and foreign-key relationships exist across tables on a particular data column. Sanitizing across databases also requires use of the same hashing key.
Issues related to sanitizing e-mail fields. In many applications, an e-mail address may be used for performing Windows® authentication or as part of distribution security groups on which role-based permissions are granted. If the e-mail address is sanitized through random string values, the application authentication logic will fail. The solution is to replace e-mail addresses with valid test-only e-mail accounts and make them part of the security groups.
Repeatable sanitization. Each import and run of the sanitization process should yield the same sanitized values to facilitate successful and repeatable test cases.
Sequential data sanitization. The data sanitization tool must respect the sanitation order that the administrator chooses, if a workflow depends on the database. For example, first name and last name may need to be sanitized before full name is sanitized.
Verification of constraints and indexes. Indexes and constraints must be intact before and after sanitization of the database.
During the course of planning and implementing the data sanitization project, the HR IT team learned lessons that yielded the following best practices.
Use an Appropriate Strategy for Data Sanitization
Choosing the right strategy is based largely on the nature of the data elements and how applications use that data. The HR IT team found that the primary key within the data was frequently necessary to integrate with other applications. In most of the cases, not sanitizing the primary key is strongly advocated because it eliminates the need to perform additional testing on important business rules that rely on the primary key. For example, a primary key of e-mail address might be used as a unique identifier across applications. This data cannot successfully be sanitized without losing the key.
Use a Proof of Concept
Using a proof of concept was important for the success of the project. By using only two applications, the HR IT team limited the scope of the effort while gaining insight into best practices for sanitization. The approach that HR IT used for the proof of concept was to sanitize non-PII data values. HR IT sanitized the data values associated with the employee and the data that the employee provided. It would still be possible to identify the person by his or her name, but it would not be possible to match that employee with the data associated with the employee, because the data itself would be scrambled. For example, employee career tracking, competency assessments, stock level, and similar data points are now sanitized. Employee names are known and available from other systems for regular business use. The method described, however, does not apply to customer-related PII.
The strategy of sanitizing non-PII data values works well for HR-related applications. Data points like personnel number and e-mail address are the key elements used within the business logic for such applications. Leaving these primary-key data points intact enabled the HR IT applications to continue operating as usual.
Create Small Test Samples for Process Testing
Some applications are tightly integrated with other applications. HR IT recommends creating a small sample of data that originates from internal databases for process testing when sanitizing all non-PII values is not feasible. Using samples of production data as input for the data sanitization also presents an alternative if the cost and time for sanitizing data is unacceptable because the size of the database that must be sanitized is large (100 gigabytes or larger).
Many Microsoft HR IT applications use common access points to data that previously contained PII. When developers create applications, they frequently need access to relevant, realistic test data. HR IT undertook an effort to provide data that was free of PII but still remained useful for developers.
As part of the process, HR IT examined three approaches for data sanitization and chose to sanitize information that was not personally identifiable while still maintaining the important key relationships that many applications need. HR IT used an internal application for data sanitization and enhanced the application by automating the sanitization process. HR IT identified several testing scenarios as high priority to help ensure that the sanitization was successful and did not affect any applications.
For More Information
For more information about Microsoft products or services, call the Microsoft Sales Information Center at (800) 426-9400. In Canada, call the Microsoft Canada Order Centre at (800) 933-4750. Outside the 50 United States and Canada, please contact your local Microsoft subsidiary. To access information via the World Wide Web, go to:
© 2010 Microsoft Corporation. All rights reserved.
This document is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY. Microsoft, Windows, and Windows Server are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries. The names of actual companies and products mentioned herein may be the trademarks of their respective owners.