Design Concepts for Correlating Digital Identities

Article
10/05/2010

Applies To: Windows Server 2003 with SP1

The success of an identity integration solution is closely tied to the quality of the planning you do upfront. To properly design and plan an identity integration solution, you must have a complete understanding of how the design decisions you make affect the data flow in and out of the Microsoft® Identity Integration Server (MIIS) 2003 environment. The key to making good design decisions is to understand the entire process (global picture) of designing a solution as well as the nuances of how the different components in an MIIS environment work together. To create the most efficient solution, you must have a good idea about how to design your implementation to take advantage of those nuances. The pertinent information that will assist you in your solution-development effort is covered in the MIIS Design and Planning Guides.

In this set of documents, you will find detailed discussions of specific challenges that are often encountered during the design of MIIS solutions. These documents present some of the most common design issues that are discussed in newsgroups and in e-mail discussion groups. In each document, you will find the following:

Detailed explanations of particular design challenges.
Possible solutions with best recommendations.
Discussions of the pros and cons of potential solutions.

These challenges, and their proposed solutions, have been discovered and documented through numerous discussions with MIIS deployment experts. Once documented, further review cycles have been conducted on each solution by several MIIS deployment experts from both within and outside Microsoft.

While these guides present information about specific design challenges, anyone doing any type of MIIS design work might find them interesting. These solutions can provide insight into design issues not specifically addressed in these documents.

We hope you find these documents useful. If you would like to discuss the content of a document or if you have any questions, feel free to post a message on the MIIS newsgroup on the Microsoft Web site (https://go.microsoft.com/fwlink/?linkid=45219).

About This Guide

MIIS provides a central point of administration for identity data that is distributed across a wide variety of data repositories in an enterprise. To implement a central point of administration, all available identity data parts that belong to the same physical identity have to be aggregated into one logical view. From the technical perspective, the process of aggregating identities consists of two parts – projection and join. Projection initiates the creation of an aggregation point within the system. After the aggregation point is established, other identity data parts that are representations of the same physical identity have to be joined to it.

You can automate the join process by configuring a join-synchronization rule specific to that situation. This rule is based on attributes of the available identity data parts. In many cases, the connected data sources in an organization are deployed independently by different departments. This constellation does not always allow join rules to be defined in a way that guarantees a high success rate of aggregation of identity data parts in an environment.

This document discusses considerations for mapping multiple identities and how you can deploy a Correlation ID to establish strong object relationships in your identity integration solution.

Challenges of Aggregating Identity Data Parts

Most areas in today's computed environments require individualized processing of digital data. Therefore, computer tasks typically run in the context of a digital identity. Computer networks store information about digital identities in the form of identity data. A digital identity is a collection of attributes that make up an electronic representation of who or what an identity is. Examples for digital identities are security principals like user objects in Active Directory, or employee records stored in an HR database. The type of information that is stored in a digital identity depends on the context in which it is used. For example, to log on to a directory service, a logon ID is an indispensable piece of information that must be stored with each user object. However, it is not at all required to store this attribute in the context of a human resources (HR) database.

In general, each connected data source has a specific set of required and optional attributes that are stored with a digital identity. The absence of a global, all-encompassing schema for the storage of digital identities is a challenge for managing them from a central point. To workaround such scenarios, all available identity data parts that belong to the same physical identity have to be aggregated into one logical view.

In MIIS, you can automate the process of aggregating identity data parts by defining a join-synchronization rule that is based on attributes of the managed identities.

Since digital identities are specific to the connected data sources where they reside, it can be challenging to find attributes that can be used to unambiguously join identity data parts across various connected data sources into one logical view.

In the following sections, you will find more details on what these challenges are and how to address them in your solution design. Before we look at the cross-connected data sources scenario, we should first take a look at a solution used to uniquely identify digital identities within a single connected data source.

Understanding Digital Identifiers

Most attributes that are stored with a digital identity describe real-world characteristics of a physical identity. Examples for real-world characteristics are attributes like first name, last name, or date of birth.

In a collection of digital identities, real-world characteristics can have the same values. Hence, real-world characteristics are typically not a sufficient mechanism to uniquely identify a digital identity. It is therefore important that each digital identity have at least one attribute that is used to uniquely differentiate digital identities from each other. Such an attribute is known as digital identifier.

The scope of a digital identifier

Digital identities are usually defined within a certain context. For example, each user object within Active Directory is created to control access to resources managed by the directory service on an individual basis. You can use each user account to securely access resources within the forest in which the account is defined, and in all trusted domains. However, an Active Directory user account is meaningless outside of an Active Directory environment.

This scope limitation of digital identities also applies to the associated digital identifiers. For example, an object’s SID is guaranteed to be unique only on a per domain basis in Active Directory. However, there is no guarantee that a given SID is unique in all networks (on this universe). Another example is the Social Security Number (SSN). An SSN is a digital identifier that identifies a person within the context of the US social security administration and the internal revenue service (IRS).

In an identity integration scenario, the context of a digital identity is typically scoped on the basis of the configured connected data sources.

Value control

Digital identifiers are typically under the exclusive control of a connected data source. This means, only the connected data source can set its value. For example, each object in Active Directory is identified by a GUID and is issued by the directory service. This value cannot be changed and is associated with the object throughout its lifecycle. This is the most common case for digital identities. However, there are also some digital identifiers that can be set by external processes, for example samAccountName in Active Directory. In such cases, the external process must provide the unique identifier, while the system maintaining the identifier must ensure its uniqueness.

Types of digital identifiers

You can find various implementations of digital identifiers in a computer network. It is also possible that a given digital identity has several types of associated digital identifiers in the same context. This is because there is no single “right” format for a digital identifier. Although different types of digital identifiers share the same goal, which is to uniquely identify a digital identity, there are different ways to accomplish this. For example, you can use a counter, or can use an algorithm, to calculate unique values for digital identifiers. In addition to uniquely identifying digital identities, some digital identifiers contain supplementary information about the environment. Active Directory is an example of an environment in which different types of digital identifiers are associated with an identity. For example, all security principals, which are objects recognized by the security subsystem, have amongst others, the following three digital identifiers:

A GUID
A SID
A samAccountName

GUIDs

A GUID is 128-bit random number with a predefined storage and presentation structure. An example of a user-friendly string representation of a GUID is as follows:

936DA01F-9ABD-4d9d-80C7-02AF85C822A8

The GUID is an example of a unique identifier that is calculated by a specific algorithm that generates values without requiring any knowledge about already issued identifiers. The algorithm used to calculate a GUID value assures statistical uniqueness. This means, that although theoretically possible, it is very unlikely to find a duplicated GUID value calculated by the same instance. A GUID does not contain any additional information about an object. Therefore, there is no need to change its value throughout the lifecycle of an object even if any characteristic of the object changes.

The benefit of using GUIDs instead of unique counters as digital identifiers is that you do not need to track or back up already issued identifiers in an environment to calculate a new value. An implementation that is based on unique counters requires that all issued values are available in a central location to determine a guaranteed unique value for a new identity.

In Active Directory, an object’s GUID is under the exclusive control of the directory services. This means, you cannot specify this value during the creation of an object and you cannot change it throughout the lifecycle of an object.

SIDs

Another example of a unique identifier in Active Directory is the security identifier (SID). A SID is a variable-length data structure used to uniquely identify security principals. The following is an example for the user friendly string representation of a SID:

S-1-5-21-972711368-3517537916-1357243200-1108

Like in the case of GUIDs, each SID value is under the exclusive control of the directory service. The major difference between a GUID and a SID is the fact that each SID contains additional information about the object. Each SID contains information about the issuing authority, which ties them to the domain in which they were created.

In the example above, “972711368-3517537916-1357243200” represents the identifier of the issuing authority and “1108” is the relative identifier (RID) of the security principal to which the SID is assigned. You can always determine the issuing authority with a tool or a script by using the SID without the RID as the input parameter.

If you already know the domain identifier of a SID, you can determine the issuing authority of a SID without knowing anything else about the security principal to which a SID is assigned.

Incorporating additional information into a digital identifier adds a significant constraint to it. The value may have to be changed within an object’s lifecycle. If you move a security principal across a domain boundary in an Active Directory forest, the object’s GUID remains the same. However, the value of the SID will change, because SIDs have a RID that is relative to the issuing domain.

samAccountNames

The third example of a unique identifier in Active Directory is the samAccountName. A samAccountName is a string identifier of less than 20 characters. Unlike other digital identifiers discussed so far, a samAccountName is not under the exclusive control of the directory service. You can specify the value during the creation of a security principal and the value is guaranteed to be unique only on the domain in which it is issued. The samAccountName was introduced to cover a key aspect of digital identifiers missing from GUIDs and SIDs – user-friendliness. A value like “936DA01F-9ABD-4d9d-80C7-02AF85C822A8” or “S-1-5-21-972711368-3517537916-1357243200-1108” is hard to memorize and is just a string of random numbers. While this is not an issue for a computer, it can be an issue when a person interacts with a computer, for example, when a user logs on to a system. In this scenario, having a digital identifier that is user-friendly and that has a value that is easy to memorize can significantly simplify the process of a user interacting with a computer system.

One challenge of using user-friendly digital identifiers is that it becomes cumbersome to automate the calculation of the values. This is because the definition of what user-friendliness means is subjective. A given value can be easy to memorize for one person and completely difficult to memorize for somebody else. This becomes an issue especially if the digital identifier is used in an organization that operates across cultural boundaries.

We have already stated that there is no definition for the “right” format of a digital identifier. Each of the formats discussed have some of the following pros and cons:

If a value contains additional information about an object, it is possible that the value may be changed within the lifecycle of the object. Ideally, a digital identifier should not change throughout an object’s lifecycle.
If a value is user-friendly, it is easy to memorize, however, it is also challenging to build such a value in an automated manner.
A good algorithm to calculate a statistically unique number should not be constrained by input requirements.
A random number is a good technical solution. However, this value is also difficult to memorize.

The right type of digital identifier depends on what best fits your business requirements. You should analyze your business requirements carefully because applying changes to digital identifiers that have been already deployed in an organization can turn out to be a very costly operation.

Considerations for Mapping Multiple Identities

One of the goals of an identity integration scenario is to link the “right” objects together across various connected data sources. In MIIS, the process of mapping multiple identity data parts into one global view of a digital identity is called a join.

Identity data parts can be joined either manually or automatically. An automated join has an associated join-synchronization rule. This synchronization rule is based on one or more attributes of an identity data object. The ideal automated join rule consists of one attribute – a digital identifier. This type of a join is the most robust and unambiguous form of a join. However, if the data repositories you are dealing with have been deployed by different departments independent of each other, each of them can have different digital identifiers. In such a situation it is not possible to have a join based on the available digital identifier. As an alternative, you can use a combination of attributes of identity objects. These attributes, known as overlapping attributes, are distributed attributes containing the same information.

The challenge of using overlapping attributes in a join-synchronization rule is to find attributes that provide a high success rate. A high success rate is characterized by a low number of ambiguous results. Ambiguous results refer to join attempts that can return multiple join candidates. In this case, you need to provide code for a join resolution that enables the system to make the right join decision. The ideal join scenario is based on one attribute and results in zero or one join candidates. Such a scenario is also known as robust join. A join scenario that is not based on digital identifiers and can return more than one possible join result is also known as complex join. Scenarios based on overlapping attributes are usually complex joins whereas joins based on digital identifiers are robust joins. There are, however, exceptions for both cases.

The complexity of a join-synchronization rule is tied to the number of attributes used in it. Your goal should be always to define join-synchronization rules that are as robust as possible. This is partially related to the fact that established joins are not reevaluated. Once a join is established between two objects, it will not be further evaluated in successive synchronization runs.

Joins based on digital identifiers

A digital identifier is the most robust attribute for an automated join-synchronization rule. To uniquely distinguish between individual identities, connected data sources must have at least one digital identifier. We have already discussed that these identifiers are usually tied to the data repository. To recap, the object GUID in Active Directory is a unique identifier for all objects managed by the directory service. The employee ID of an HR system is a unique reference for all employees of a company. In the context of the HR system, the Active Directory GUID is meaningless. Even, if the connected data sources in a scenario are of the same type, the already available digital identifiers cannot be used for an automated join without additional work. For example, in Active Directory, the reason for this is the fact that GUIDs are scoped on the basis of a forest. If a user has two accounts in two separate forests, the associated user objects have different GUIDs. One option to address this problem is to distribute one of the digital identifiers across the connected data sources of a scenario. You will find more details about this approach in the 'Understanding Correlation IDs' section.

Note

A join based on a digital identifier does not always work. There are exceptions that require special considerations. One example is a digital identity having more than one account in a connected data source. You can find this instance in Active Directory. It is a best practice for personnel with administrative accounts to also have regular user accounts. These user accounts should be used if the current task does not require administrative privileges. Such special cases, however, can be addressed in the design of your solution.

Joins based on overlapping attributes

The majority of automated join-synchronization rules in today’s environments are based on overlapping attributes. Common examples for overlapping attributes are attributes representing the first name, last name of a digital identity. One of the advantages of overlapping attributes is the fact that this constellation does not require any changes to the connected data sources. However, this solution requires an extensive analysis of the available source-identity data to provide satisfying results. There are three considerations you should factor into such an analysis:

The values for the same attribute can be different across multiple connected data source.
The values for a specific identity can conflict with other objects.
The data in a given connected data source can contain duplicates for a given identity.

In certain cases, it is very likely that even the values for the same physical identity are not the same across various data repositories. For example, it is possible that a user, Dan Park appears as Dan Park in the HR system and as Daniel Park in Active Directory. If your join-synchronization rule is based on these attributes, it is possible that matching objects cannot be joined due to spelling errors of the same attribute value. In our current example, Dan Park and Daniel Park would not result in a matching join pair although both identity data parts represent the same digital identity. Since attributes like first name and last name do not represent digital identifiers, there is usually no consistency check that enforces the same value across various connected data sources. Such a consistency check, however, can be done by MIIS after the right identity data parts have been joined.

Typos or different spellings of a name are not the only issues related to names. Most countries/regions have first name, last name combinations that are very commonly used. An example for this is "John Smith" in the USA. It is very likely – especially in larger environments – to find several digital identities with the same first name, last name combination, which can result in several possible join candidates. In this case, these two attributes are either not sufficient to join the right objects together or they require an extensive conflict-resolution solution.

Another factor for an automated-join process is the degree to which the identity data in the connected data source is joinable. Duplications are also caused by different identity records with the same values. Especially in larger data sources, it is not unlikely to find duplicated records for the same physical identity. In such cases, you are not only confronted with duplicated names assigned to different individuals, you also have to encounter the issue of duplications for the same entity.

As part of your solution design, you should carefully analyze the already available data and determine the likelihood of strong join results. If the results of your analysis indicate that the possibilities of generating strong join results are low, you should consider a different approach to establish link relationships between your identities.

Understanding Correlation IDs

The ideal join implementation is based on a single unique attribute that can be used to produce reliable join results. If such an attribute does not exist in your environment, you should determine whether introducing a new attribute into your environment can solve this issue. This is when you should consider Correlation ID.

Correlation ID is an attribute:

That uniquely links various identity data parts together across different connected data sources
That does not change its value throughout the lifecycle of the identity

Correlation IDs significantly reduce the complexity of your join rules and guarantee a robust join.

The following section discusses the requirements for a Correlation ID and it will help you decide if you should introduce a Correlation ID into your environment.

Requirements for Correlation IDs

A Correlation ID must fulfill the following requirements:

The attribute value must not change throughout an object’s lifecycle.
The attribute value should not contain real-world data.
Correlation ID should not be under the exclusive authority of a connected data source.

Correlation ID values must be constant throughout the lifecycle

The first requirement for a Correlation ID is that its value must not change throughout the lifecycle of an identity. In an identity integration scenario, the lifecycle of an identity is defined by your business requirements for a physical identity. An employee in an HR system is a physical identity whose lifecycle is determined by the period for which a record of the employee exists in the HR database.

It is very important that you carefully analyze your business requirements and then decide on the attributes and joins for your solution. For example, a classic identity integration scenario is an HR department using Active Directory, in which the HR system represents the authoritative data source. In this scenario, your synchronization rules may be implemented in a way that a metaverse object is deleted when an object in the HR system is deleted. Before you implement a Correlation ID into your solution, you must consider the following:

Do all objects that are managed by your identity integration solution have a representation in the HR system - Most HR systems maintain only a record for employees. The corporate relationship of a person can change over time. After a person is hired, the person can change roles to be a contractor or vendor, and can at some point be rehired. It is not necessary that a person changes function in a business environment when the corporate status changes. The question in this scenario is whether only employees are maintained in the HR system. This is true for some corporations, but not true for all of them.
What are the types of identities - The next question is whether all objects managed by your identity integration solution are people. An identity integration scenario can manage different types of real-world identities, from people to printers or conference rooms. In most cases, it is very unlikely that nonhuman identities are maintained in an HR system.
What are the complete set of identities and which is the authoritative data source - If your business requirements dictate that a person whose corporate relationship changed from employee to vendor, still requires access to the corporate network and vendors are not maintained in the HR database, deleting a metaverse object because the record in the HR database has been deleted would break the identity lifecycle. You should determine the complete set of different identities in a scenario before you declare the HR system in an identity integration scenario to be the single authority of the identity data in your environment.
What are the effects when an identity's access to a data source changes - The account requirements of an identity might change even if the employee status does not change. People can change positions in an organization over time and with these changes, the access requirements for connected data sources may change. Access requirements to data source X may be required for position A. They might not be required after a change to position B, but may be required again after the person moves back to position A.

You may ask yourself, how these scenarios relate to the first requirement for the Correlation ID, which dictates that the value of an attribute must not change throughout an object’s lifecycle. You might be inclined to pick one of the already available digital identifiers in your system as the Correlation ID. However, you should also keep in mind that the digital identifiers in a connected data source are mostly specific to the context of the data source.

If the HR database does not contain a record for all the identities managed by MIIS, you cannot use employee ID as source for the Correlation ID. If network access to the corporate network is only required for certain functions, you cannot use an object’s Active Directory GUID, because the GUID value is gone forever after an Active Directory object is deleted. When you recreate an Active Directory object for a physical identity, the object is assigned a different GUID. You could also come across an accidental deletion, which would also require the recreation of an account.

Note

The question of whether an already existing digital identifier is a potential candidate for becoming a Correlation ID is tied to whether there is a single connected data source available that hosts a digital identifier for all objects maintained in a system throughout an object’s lifecycle. In an optimally configured solution, the identity lifecycle is closely related to the lifecycle of the metaverse object representing the aggregated view of a physical identity. Note that each metaverse object represents a central point of administration. You need a metaverse object if you have at least one connected data source with identity data that has to be managed by your business rule implementation in MIIS. You should delete the metaverse object if there are no contributing identity data parts anymore.

The value of the Correlation ID should not change throughout an object’s lifecycle, because a change would require that all data sources storing that value must be updated. In a larger environment this can lead to significant operational overhead as the value might have to be updated in several locations.

All additional requirements for a Correlation ID can be directly derived from the first requirement.

Correlation ID values should not contain real world data

The second requirement for a Correlation ID is that the Correlation ID value should not contain any information that is related to real-world data. An example for such an identifier is an object’s SID. Each SID created for an object contains the identifier of the issuing authority. In Active Directory, the issuing authorities are scoped on the basis of domains. If you need to move an object across a domain boundary, you will automatically also change the object’s SID. Another example of this scenario is the distinguished name (also known as DN) of an object. The distinguished name of an object is tied to its location. This example has the same constraints as the object’s SID – if you move an object to a different container, the digital identifier changes. There is usually no attribute in an identity integration scenario with a real-world relationship that does not change throughout the lifecycle of the maintained objects.

If real-world data is incorporated into a Correlation ID and that data changes, you also need to update the Correlation ID.

Correlation ID values should not be under the exclusive authority of a connected data source

The third requirement for a Correlation ID is that its value should not be under the exclusive control of a connected data source. Such an attribute should allow write access if necessary. We have already discussed the requirements for recreating accounts in Active Directory. The Active Directory GUID is under the exclusive control of the directory service, which means that the directory value controls its value.

Note

If it is necessary to update the value of a Correlation ID, you have to programmatically set this value.

Other Considerations for Correlation IDs

Next to these requirements, there are other considerations that may be part of your decision making progress. The most common Correlation ID-related considerations are:

User-friendliness of the value
Value recreation through a repeatable process

User-friendliness of Correlation IDs

User-friendliness is a common talking point in conjunction with digital identifiers.

While random numbers are a perfect format for computers, they are not easy to memorize for people. In the previous section, we have already discussed samAccountName as an example of a user-friendly digital identifier. One problem with user-friendliness is that the definition of user-friendliness is very subjective. It is difficult to guess when a value is really user-friendly. The other dilemma is that the Correlation ID has to be unique and should not contain any relationship to real-world data. It is presumably pretty hard to find an algorithm that satisfies all these requirements.

The primary question that you must answer based on your business requirements in the context of a Correlation ID is whether the value has to be user-friendly at all. As stated before, the most common argument for user-friendliness is retention. The Correlation ID does not necessarily have to be exposed to users. It is possible to maintain the value only as link mechanism used by computer systems. In this case the question of user-friendliness is redundant.

Due to the subjectivity of this problem, automated algorithms generally have flaws in the sense that they can only provide user-friendly values under certain circumstances. For example, in some countries/regions it is common to find relatively simple formats for first names. However, this is not true for all countries/regions. Some countries/regions in the Asian countries/regions may have names that are hard to memorize for Europeans or Americans. User-friendliness is connected to two aspects:

Whether a value is exposed to a user
How often will an exposed value be used

If a Correlation ID value is to be used once in a while, it is not really necessary for that value to be user-friendly. In this case, it might be acceptable to lookup the value from a personal notebook. However, a value like the user logon ID has to be certainly user-friendly as it needs to be used frequently. We have already discussed the fact that even one connected data source like Active Directory has several different digital identifiers with different formats because all the different requirements for them cannot be covered by one single format for a digital identifier.

The following two considerations are tightly related to each other:

Should it be possible to create an identifier by a repeatable process?
Where should the Correlation ID be stored?

The following sections discuss these considerations in greater detail.

Value recreation through a repeatable process

As discussed earlier, a GUID is an example of a value that cannot be recreated if it is lost or deleted. A repeatable process is related to the idea of having a simple backup mechanism. In case the attribute value is lost, this attribute value can be restored or recreated using a simple backup mechanism. In case of a repeatable process, the input to a formula that generates a Correlation ID must contain real-world identity data such as such as first name, last name, or date of birth, which can be hardly lost.

However, this consideration of creating a value through a repeatable process has several significant flaws. It is possible that you may not have a single connected data source that has the information for all identities maintained in a scenario. If an attribute that you use for generating a Correlation ID has multiple sources, you need to implement another property or attribute to track the authoritative source repository for a given identity. Since this information must also be stored and backed up, the fact that you use a repeatable process to generate this value is not really useful in a backup scenario. If the source identifiers come from different source repositories, it is also likely that both repositories do not share the same attribute set. In this case you also need to store the specification for the source data used as input for the calculation of a given parameter somewhere else.

Note

In general, you should consider Correlation IDs generated by a repeatable process, only if you have one source data repository for the generation of the value. However, it is not a requirement, and a good backup solution for your Correlation IDs can address the problem that is to be solved with this type of identifier in the same way.

In addition, you must also determine if you need at least one input attribute that uniquely identifies the physical identity for this process to work. Since these attributes typically contain sensitive information as in the case of a social security number (SSN), it is unlikely to get access to this kind of information to calculate a Correlation ID.

The following discussion about the storage location of the Correlation ID provides a solution that handles the scenario where a value is generated with a repeatable process.

Maintaining Correlation IDs

After addressing the basic requirements for building a Correlation ID, you have to determine a proper storage location for the value. The decision for the storage location represents somewhat of an exception to other requirements and considerations, because it is not exclusively driven by business requirements. The storage location of an identifier also depends on the technical capabilities of your environment. There are basically three available approaches for storing Correlation IDs – centralized, decentralized or a combination of them.

Centralized Storage

The centralized approach for storing Correlation ID values does not require an extension of the schema in the connected data source. It is usually implemented on the basis of one central connected data source that retains all digital identifiers that a digital identity has in the various connected data sources. In this approach, the necessary join parameters for the identity data parts are provided by the central digital identity repository. Technically, this approach also requires the availability of one central department in your organization that is in charge of defining all identity object relationships. From this central identity data repository, you can analyze and update all the different aspects of your identity integration scenario.

Decentralized Storage

In the decentralized approach, the Correlation ID values are stored with all connected data sources. This approach requires that a connected data source is extensible. Not all connected data sources have an extensible schema. For example, you cannot extend the directory schema of Windows NT with additional attributes. However, if a connected data source is extensible, this implementation adds some effort to the deployment phase of your solution. After the data source schema has been extended, the Correlation ID has to be deployed to the objects in an environment. There are two major benefits in a decentralized solution:

You do not need an extensive backup solution. As long as at least one object in your identity data network carries the Correlation ID value, it can be distributed to the connected objects.
You can get information about the solution state by looking at an object’s attributes in the connected data source. If an object has a value assigned, it has been mapped by the correlation process.

The benefits of the decentralized approach depend on your personal preferences. It might be also desirable to lookup the meaning of an identity from one central point instead of collecting it from the various connected data sources. In this scenario, the central storage of a Correlation ID and a related Correlation record might be the better solution. In a centralized solution, you also do not need to worry about the backup policy of the individual connected data sources. The business requirements can be rebuilt based on the information stored in the central location.

Combination Storage

Depending on your individual needs, you can also deploy a combination of both approaches. In our discussion of the centralized approach, we have already outlined the technical limitations of this approach in case of nonextensible data source schemas. Besides technical limitations, there are also political boundaries. It is possible that the owners of a connected data source are very reluctant to allow an extension of the local data repository schema and to introduce a related attribute value that is owned by a remote system. However, this might not be true for all individual owners of identity data repositories in your environment.

Calculating Correlation IDs

The previous sections introduced the general requirements for a Correlation ID. This section discusses possible implementation options.

There are four main strategies for creating unique values:

Using an existing digital identifier
Using existing identity data
Using an algorithm that independently generates statistically unique values
Using a counter in a database that guarantees uniqueness by calculating a new value after comparing existing values.

Each of these strategies has associated pros and cons which are discussed in the following sections.

Using an existing digital identifier to generate Correlation IDs

The simplest implementation of a Correlation ID is based on using an existing identifier. However, this requires a connected data source that has a record of all the identities managed in your identity integration scenario. If you think that such a connected data source exists in your environment, such as an HR database, make sure that your assessment is really accurate. As explained earlier in this document, the employee status of a person can for example change, however the function within an organization can remain the same. It is possible that, a data repository does not have all the required information; for example, a HR system might not have all required information. Also, determine if it is necessary to manage inanimate objects that are not recorded in the HR database, such as printers and conference rooms, within your solution.

Using existing identity data to generate Correlation IDs

Another approach to build a Correlation ID is to calculate a value that is based on data from the original object. This solution might be useful in cases where you have a single authoritative data source with a digital identifier that is security sensitive. For example, if the employee ID covers your requirements for a Correlation ID, you have on the other side, concerns that exposing the employee ID could introduce a security risk. To counter this, you can generate a hash value that is based on the employee ID.

The benefit of calculating a Correlation ID based on information from the source object is the fact that you can always recreate this identifier if the value is lost and a backup is not available. However, this method is also constrained by the same limitations outlined in the previous example. This method works only if you already have an identifier that can be used, or if an authoritative object for all identities of your scenario is available in one authoritative data source. It might be challenging to find a common set of input data that can be used to calculate such a value. In this scenario, one challenge could be to find a set of attributes that works for people and inanimate objects alike. Also remember, that one of the requirements for a Correlation ID is to avoid any kind of real-world data in the value.

Note

The business requirements can change in today’s economy very quickly. Even if you have today a single authoritative connected data source with all identities, the next acquisition can change this rather quickly. You should carefully analyze whether introducing environmental dependencies into a Correlation ID represents a good long-term solution for your environment.

Using an algorithm to generate Correlation IDs

There is no technical requirement that dictates that a Correlation ID must be based on data of the correlated identities. It is also possible to deploy a Correlation ID that has no direct relationship to the digital identity the value is assigned to. Because today’s organizational networks are highly distributed data islands, a complete set of information in such an environment is typically not available in a single location. One approach to address this issue is to calculate values that are based on an algorithm that assure statistical uniqueness. For example, the digital identifiers in Active Directory – GUIDS - are generated by such an algorithm. GUIDs and the method for creating them can also be used to calculate unique Correlation IDs.

A GUID does not have any built-in relationship to the object it is associated with. Its value is unique and as such there is no need to change it throughout an object’s lifecycle. Generating GUIDs is very simple. The Microsoft .NET Framework provides a convenient method (“System.Guid.NewGuid”) to calculate a GUID value.

Like every solution discussed in this document, the GUID-based Correlation ID has also some considerations. GUID values are not really user-friendly nor are they intuitive. Since they do not have any relationship to the associated data record, they require a solid backup strategy, because the value cannot be recreated in case of data loss. However, as stated earlier, these considerations are not necessarily disadvantages.

Note

If a GUID is used as Correlation ID, the attribute holding the value should in addition to other common implementations also allow write access, in case it is necessary to update the value. In other words, if a GUID is used as Correlation ID, the GUID value should not be under the exclusive authority of the process providing the value.

Using a counter to generate Correlation IDs

The last implementation strategy for Correlation IDs discussed in this document is a simple counter from a database table column that has been defined as “Identity”. In this case, each new row in the table produces an auto-generated unique value that can be used as Correlation ID for the managed identities. This solution provides guaranteed uniqueness since all counters are maintained in one database. However, it adds complexity to the scenario as the database table requires maintenance and this solution increases the operational costs.

Understanding Conflict Management

Whenever there is a requirement for uniqueness, conflict management becomes a concern. In the previous section, we discussed requirements for Correlation IDs in an identity integration solution. In this section, we will discuss how to address uniqueness requirements of connected data source objects and how to address them from within your solution.

Uniqueness requirements of connected data sources become a concern if your solution is in charge of specifying attribute values that have to be unique in a connected data source. This is required in scenarios where a digital identifier is not under the exclusive control of a connected data source. For example, you can specify the samAccountname of a security principal during the creation of the object.

It is not always possible or desired to solve conflicting situations from within MIIS. In situations where a conflict is related to an attribute that has to be user-friendly, it is possible that user interaction is required to resolve the conflicting situation. It can be challenging to determine a user-friendly value with an algorithm because the determination of what “user-friendliness” means is subjective.

A complete conflict management solution consists of three components:

Conflict Detection
Conflict Mitigation
Conflict Resolution

Conflict Detection

Detecting a conflict in an identity integration scenario is not always straightforward. Each connector space object stores different views of the related real-world objects in the connected data source. Since there is an indefinite number of possible connected data source attributes available, the attributes of a given connected data source object that are staged on a connector space object are encapsulated in a single binary attribute. Because of this architecture, the connector space is not searchable. A search operation would have to evaluate the values stored in this binary attribute on each connector space object in such situations.

For example, if you provision a new user object to a connector space and you want to avoid a conflict of the samAccountName attribute with already staged objects in the target connector space, you can accomplish this by determining whether this attribute exists for each object. If the value exists, you should compare the object's value with the value of the provisioned object. The only conflict you can directly detect during provisioning is a distinguished name conflict. If you attempt to provision an object with a distinguished name that has already been assigned to an object in the connected data source, an exception is raised.

Another approach to detect conflicts is by issuing a search request to a connected data source. However, this technique comes along with some significant caveats. The search result is only relevant at the time when the search request was sent. Search requests to a connected data source are typically sent during provisioning and the goal of such a search is to create an object in the connected data source that has no conflicting attribute values. There is a delta between provisioning and an export run. An attribute value that was not in conflict during provisioning can collide with an object in the connected data source when an attempt to export the object is made.

One important design goal of MIIS is to enable the synchronization process to always function correctly with the latest information received from a connected data source. In other words, the synchronization process should be independent of the current status of the data stored in a connected data source. Each callout to services outside MIIS during provisioning violates this design goal. If your synchronization logic depends on information requested from a connected data source, a synchronization cycle cannot be completed in case of network problems or other issues with the remote computer to which a request was sent. You should only use this method if these considerations do not represent a significant blocking issue for your solution.

The best method for detecting conflicts in MIIS is the metaverse search. The metaverse search is an integral feature of MIIS and operates independent of connected data sources. However, this requires that the majority of the available identity information is processed towards the metaverse. If you are blocking a significant amount of identity data in the connector space by filtering it with a connector filter, you are hiding important data from the synchronization process. It is recommended that you keep the amount of identity data that has not been processed towards the metaverse at a minimum.

Conflict Mitigation

It is not always possible to immediately resolve a conflict situation. For example, you can detect that a samAccountName conflict can result if the identity object that you are trying to provision is already available in a target connector space. In this scenario, it is not necessary to calculate nonconflicting attribute values, but it is necessary to make sure that the object in the target connector space will be later joined during the identity integration process. Another situation to consider is conflicts in the context of user-friendly attributes. As discussed before, user-friendliness of an attribute value is subjective and it might not be always possible to calculate such a value with a software algorithm. Depending on your user-friendliness requirements, it might be even necessary to resolve such a conflict on the basis of manual user interaction. In such a scenario, it is important to provide at least one appropriate conflict-mitigation strategy. The simplest implementation of such a strategy is to provide a flag that indicates a conflict situation that is set to avoid further processing of the object. It might also be necessary to implement a communication solution to inform administrative personnel about the occurrence of a conflict. In your solution design, you should always analyze each possible conflict scenario and determine if a direct conflict resolution is possible and necessary.

Conflict Resolution

The last step in the conflict management process is conflict resolution. The conflict resolution analysis needs to answer where a conflict has to be handled and whether the resolution process requires manual interaction. It is not always possible to resolve conflicts from within MIIS. For example, if you are synchronizing objects that include reference values with a connected data source that does not provide referential integrity, a reference to a nonexistent object can cause a situation that cannot be fixed by MIIS from within. MIIS creates a representation for each object reference in the connected data source. If the referenced object has not been imported from the connected data source, a placeholder is created in the connector space. Determining why an object reference is represented in form of placeholder and whether the placeholder is a result of a reference to a nonexistent object, requires contacting the related connected data source. This is the reason why, by design, it is not possible to provision an object "over" an existing placeholder. The system, in this scenario, does not have enough information to determine whether such an operation is appropriate. The placeholder could be, for example, a result of an incomplete import operation such as a network error.

If a placeholder was created due to a reference to a nonexistent object, this conflict has to be resolved in the connected data source by deleting the incorrect reference. If a conflict can be resolved from within MIIS, you need to determine whether it can be done with an automated routine. For example, a very simple method for resolving samAccountName conflicts is to add an incremental suffix to the name such as name001, name002, and so on in a loop, and search for that name until it does not have a conflict with other objects anymore. If a conflict cannot be resolved automatically, you need to implement an appropriate solution to fix this issue. One possibility is to provision these objects into the connector space of a management agent that was created to store objects requiring manual conflict resolution. The personnel in charge of manually fixing such a situation can update the attribute values in the connected data source and feed the updated information back into the system.

Summary

MIIS 2003 provides a very powerful and flexible way to automate the correlation of digital identities across various connected data sources in the form of a join-synchronization rule. This rule is based on the attributes of managed identity data parts in a scenario.

There are two attribute types you can use in join-synchronization rules:

Digital identifiers
Overlapping attributes

Joining objects based on digital identifiers is the most robust and unambiguous solution. There is no need to define a conflict resolution since a join attempt that is based on digital identifiers cannot result in multiple join candidates. However, in most cases, this solution requires the deployment of a Correlation ID.

Joins that are based on overlapping attributes can be implemented on the basis of the available attributes. However, it can be challenging in a given scenario to find attributes that provide a high success rate.

Although a join based on digital identifiers may seem to require more work than a solution based on overlapping attributes, deploying a Correlation ID not only provides a robust way to join identity data parts, but also increases the overall robustness of your identity integration solution.

Design Concepts for Correlating Digital Identities

About This Guide

Challenges of Aggregating Identity Data Parts

Understanding Digital Identifiers

The scope of a digital identifier

Value control

Types of digital identifiers

GUIDs

SIDs

samAccountNames

Considerations for Mapping Multiple Identities

Joins based on digital identifiers

Joins based on overlapping attributes

Understanding Correlation IDs

Requirements for Correlation IDs

Correlation ID values must be constant throughout the lifecycle

Correlation ID values should not contain real world data

Correlation ID values should not be under the exclusive authority of a connected data source

Other Considerations for Correlation IDs

User-friendliness of Correlation IDs

Value recreation through a repeatable process

Maintaining Correlation IDs

Centralized Storage

Decentralized Storage

Combination Storage

Calculating Correlation IDs

Using an existing digital identifier to generate Correlation IDs

Using existing identity data to generate Correlation IDs

Using an algorithm to generate Correlation IDs

Using a counter to generate Correlation IDs

Understanding Conflict Management

Conflict Detection

Conflict Mitigation

Conflict Resolution

Summary

Additional resources