Data Warehouse Import Best Practices

The following import best practices apply to the Data Warehouse.

Always run the Data Transformation Services (DTS) tasks in sequence so fact data is imported first, and then dimension data.

Always run the DTS import tasks in the following order:

  1. Web server log import (fact data)

  2. Transaction data import (fact data)

  3. Product catalog data import (dimension data)

  4. Profile data import (dimension data)

  5. Campaign data import (dimension data)

For example, if you are importing only new data for campaign reports, you would run the Web server log import DTS task first, and then run the Campaign data import task.

Never run two DTS tasks (or multiple instances of one task) at the same time against the same Data Warehouse

If two DTS tasks run at the same time, the data in the Data Warehouse will be corrupted.

You must run the Configuration synchronization DTS task and the Report preparation DTS task, regardless of which other import DTS tasks you decide to run.

For example, if you plan to import only profile data so that you can run reports about users, you must run the Configuration synchronization task, the Profile data import task, and the Report preparation task. You do not have to run the other DTS tasks.

If an import DTS task fails, or if you cancel an import DTS task that is in process, run the Data deletion DTS task.

You must run the Data deletion DTS task to delete the data that was successfully imported, and to return the aggregation class tables to their pre-import state. If you do not perform this task, the aggregations may be incorrect.

Do not run the Web server log import DTS task on a Web server that is in production. Run this DTS task on a separate computer dedicated for this purpose.

The Web server log import DTS task causes much activity for the disk, the computer that is running SQL Server, the CPU, and possibly the network. To minimize the effect on other resources, run the Web server log import DTS task on a computer that is not running SQL Server. Running the import task on the computer that is running SQL Server will affect the performance of other tasks on the computer.

To optimize the Web server log import process, set the StoreLogUser environment variable to F (false).

Set the StoreLogUser variable to F (false) if you do not have to run the IP resolution DTS task, and you do not create any custom reports that use data in the LogUser table.

By setting the StoreLogUser variable to F (false), you can improve processing speed by as much as 50 percent. The LogUser table in the Data Warehouse is not populated during the import process.

To set the StoreLogUser environment variable

  1. On the Data Warehouse server, right-click My Computer, and then click Properties.

  2. In the System Properties dialog box, on the Advanced tab, click Environment Variables.

  3. In the Environment Variables dialog box, in the System variable section, click New.

  4. In the New System Variable dialog box, in the Variable Name box, type StoreLogUser, and then in the Variable Value box type F.

  5. Click OK three times.

If you are running the Web server log import DTS task using a script, you must specify a 3 GB switch in the Windows Boot.ini file on the computer that is hosting the Data Warehouse (SQL Server) and the computer that is hosting Analysis Server.

To enable the /3GB switch

  1. Modify the Boot.ini file. For more information about the 3 GB switch, see Windows Server Help.

  2. Modify the executable running the log import DLL. Depending on how you import data into the Data Warehouse, you must modify the appropriate executable file.

Batch one day of your log files before you import them.

If you batch your log files daily and then import them, the import process will take less time than if you import your log files every hour.

Perform a Web log import just before you run the Report preparation DTS task.

For example, if you import your Web log files hourly, import only a few log files at a time. As the size of your Web log files increase, the performance as measured by gigabyte per hour will decrease.

After you import or delete data from the Data Warehouse, run the Report preparation DTS task.

You use the Report preparation DTS task to populate the online analytical processing (OLAP) cubes with the most recent site and user data that is stored in the Data Warehouse. Commerce Server 2009 uses OLAP cubes to organize summarized data that is stored in the Data Warehouse. This lets reports that are run from the Analysis modules to complete more quickly than they would without this additional step.

If multiple sites are sharing a Data Warehouse, run the Report preparation DTS task one time to process data for all the sites.

When you run the Report preparation DTS task, it processes all the sites in a Data Warehouse. Do not run the task separately for each site. It will affect the performance of the Data Warehouse.

To prevent error messages when you run the Report preparation DTS task, disable the processing of objects for which there will be no data.

The Report preparation DTS task may report errors when it processes OLAP dimensions and cubes if the information that must populate those objects has not been imported into the Data Warehouse. The error message is displayed in the following format:

     Processing error [Source data contains no rows] <object name>

To disable the processing of Analysis Services objects, set the ProcessingEnabled flag to 0 in the CubeProcInfo and DimProcInfo tables for the cubes and dimensions you do not want to process.

If you have a Web farm, collect Web server log files on a local computer before importing them into the Data Warehouse.

By collecting the files on the local computer that performs the Web server log import DTS task, you reduce the traffic on your network. We recommend that you implement the following procedures:

  1. Set up each Web server so that it compresses its own Web log files.

  2. Write a Microsoft Visual Basic Scripting Edition (VBScript) script to map drives, copy the compressed files to a central location, and then uncompress the files.

  3. Write a second VBScript that cleans the Web log files.

  4. Create a DTS "execute process" task that runs the first VBScript to copy and uncompress the log files. The task should fail if any one of the log files is unavailable or uncompressed incorrectly.

  5. Create a second DTS task that executes the VBScript to clean the uncompressed log files.

  6. Run the Configuration synchronization DTS task and then the Web server log import DTS task.

If your deployment includes multiple Web servers, you can improve performance of the import process by tuning the environment variable MaxHitPerBuffer.

The MaxHitPerBuffer variable controls some of the buffer size in the multilog-datasource parser. The default for MaxHitPerBuffer is 6. You can adjust this setting to improve performance, depending on the number of servers in your deployment and the traffic on your site.

For example, if you have 120 Web servers and you receive many unique visitors to your site, it may be optimal to set MaxHitPerBufferto 5. Any more would consume too much memory. If you have five Web servers, you may be able to set MaxHitPerBuffer as high as 10.

We do not recommend that you set MaxHitPerBuffer to a number more than 10 as it may use more than 1 GB of physical memory. If you do set MaxHitPerBuffer to more than 10, you should also enable the /3GB switch in the Boot.ini file, and use the 3 GB enabled executables to do log import.

To determine the optimal value for MaxHitPerBufferin your scenario, monitor the memory usage (private bytes and virtual bytes).

To perform a Web server log import from a Network Load Balancing Cluster, set up the server topology that will describe your cluster configuration to Commerce Server Core Systems, and then run only one Web server log import DTS task.

In this scenario, you want log files from http://<site> to appear in only one report in Commerce Server 2009. However, the Web server log files are in two or more locations, for example, \\server1\d$\iislogfiles\w3svc4 and \\server2\d$\iislogfiles\w3svc4. To set up the server topology, you can name the log files the same in each location as long as they are in the correct folder that corresponds to the Web server.

After you set up your non-Commerce Server 2009 application and Web servers, configure the properties in the Web server log import DTS task so each newly created Web server points to the log files. The log files should be in separate folders, one for each server. (You can also do this when you create the servers.) For instructions, see How to Run the Web Server Log Import DTS Task.

If you deploy a new or updated catalog, campaign, or profile definition to your Web site, and then delete it the same day, the operation may affect the completeness of data imported into the Data Warehouse.

For example, you add the product "Green Book" to your catalog. This product obtains the product ID of "X" in the catalog runtime database. Then you publish the updated catalog to your Web site. A user shopping on your Web site adds one "Green Book" to the basket. Therefore, the Commerce Server 2009 event, Add to basket**,** is logged to the IIS Web log file. One of the values in the Commerce Server 2009 event is a product ID X. The product ID references a row in a table in the catalog runtime database.

Typically, the Data Warehouse import process resolves this product ID to a full description, the Commerce Server 2009 event that had product ID "X" resolves as a foreign key to a catalog table. However, if you remove the item from the catalog database before the Data Warehouse imports it, the Web server log import task cannot resolve product ID "X" in the Web log file.

This kind of "child data" in the Data Warehouse database that is separated from the "parent data" required to give it meaning is known as "orphan data." For information about how to delete orphan data, see Deleting Orphan Data from the Data Warehouse.

To improve profile data import speed, configure the start date and query interval values.

The Data Warehouse wizard uses the default start date, today's date, and the default query interval of 15 minutes. If you import profile data that uses these default values, the Data Warehouse will try to read the data from today's date 00:00 – 00:15 and then today's date 00:15 – 00:30 and so on. Change the start date and query interval to values appropriate to the data that you are importing.

Verify the Application log in the Event Viewer after importing data.

Check the Application log in the Event Viewer after importing data into the Data Warehouse. You use the Event Viewer, Application log to confirm that an import completed successfully, or to view errors that occurred during the import process.

See Also

Other Resources

Best Practices for the Data Warehouse