Inside Microsoft.com: Analyze Web Stats with Log Parser

Article
09/08/2016

Inside Microsoft.comAnalyze Web Stats with Log Parser

Jeffrey A. Johnson

Download the code for this article: InsideMSCOM2006_08.exe (151KB)

When it comes to mining data there are few tools more versatile than Log Parser. First released as part of the IIS Resource Kit, it quickly found its way into many IT pros’ toolboxes. Log parser’s main strength is that it provides universal query access to text-based data such as log files, XML files, and CSV files, as well as a number of data sources in Microsoft® Windows® including the Event Log, the Registry, the file system, and Active Directory®. Using Log Parser you can execute queries to mine data that you wouldn’t easily be able to manipulate otherwise. In this column I will explain how you issue queries using Log Parser, how it outputs data, and finally how to pivot the data you have mined.

Log Parser lets you run SQL-like commands against a large number of different input types and send the results to a variety of outputs. Out of the box, Log Parser supports the input types shown in Figure 1.

While the list of directly supported input types is impressive, you can also add your own input types. The output types supported are also shown in Figure 1. While it might not be considered an output type, the fact that Log Parser exposes a COM interface to execute queries means that you can create an application that can output pretty much any format you want.

With all of this, there are still a few things that Log Parser can’t do. It currently doesn’t support the SQL Join operation (though rumor has it that future versions will). Log Parser also doesn’t currently have a way to pivot data so I’ve decided to add that functionality here.

Figure 1: Input and Output Types

Input Types

All IIS and https.sys log formats
NCSA
URLscan logs
Comma delimited
Tab delimited
Xml files
Event viewer logs (including current data)
Event tracing for Windows logs
Network monitor capture files
The Windows registry
ADS sources
Generic text files
The file system

Output Types

Comma- or tab-delimited files
XML
Datagrid
GIF files using the Microsoft Office chart Web component
Syslog
W3C
SQL (allowing the creation of new tables)
A template format that allows you to define a text template to format the output

Building Queries

Now let’s first get familiar with Log Parser by implementing a few simple queries that will illustrate how to extend the functionality and usefulness of the tool. I won’t go into too much depth about how to write queries; for more information on that subject, download Log Parser and read through the help file or go to the unofficial homepage.

I’ll be writing the queries in a format that can be copied into a .SQL query file for Log Parser to execute. Assuming you save the query as a file called query.sql, the command line to execute it would be similar to this:

logparser file:query.sql?inputfile=ex*.log

The ex*.log would be replaced with the logs you want to use. I’ll be writing the scripts to parse from the IISW3C formatted input; however, modifying these to work with other inputs is as simple as changing the field names.

The first example looks at the log file to determine which HTTP status codes were sent most frequently for a given time period. The algorithm to determine this is simple: get all the status codes from the log file, group them by the value of the returned code, count the number of items in each group, and sort them in descending order by the number of items. The actual query looks like this:

SELECT sc-status, COUNT(*) 
FROM %inputfile%
GROUP BY sc-status 
ORDER BY COUNT(*) DESC

While this query will work fine, you may want to refine it a little, reading the various results into the variables StatusCode and NumberOfHits:

SELECT sc-status AS StatusCode, 
COUNT(*) AS NumberOfHits 
FROM %inputfile%
GROUP BY StatusCode 
ORDER BY NumberOfHits DESC

Take a look at the output from the query, which is very useful information (see Figure 2). You could filter it a bit more by using functions such as PROPCOUNT to get the percentage of hits, or you could try to determine which URLs seem to be most problematic by selecting both cs-uri-stem and sc-status. However, I’m going to focus on a slightly different question: do my servers behave differently at different times of day? The first attempt at this simply adds the time component. The new Log Parser query looks like this:

SELECT 
TO_LOCALTIME(TO_TIMESTAMP(date,time)) 
as LocalTime, sc-status AS StatusCode, 
COUNT(*) AS NumberOfHits 
FROM %inputfile%
GROUP BY LocalTime, StatusCode 
ORDER BY LocalTime, StatusCode

Take a look at the output in Figure 3. While the data is all there, it is a bit hard to read as you have to match up the status codes, and they are not guaranteed to be in the same order (for example, the 301 responses pushed the other status codes down a row during the 10:00:00 time period). To solve this problem, you need to pivot the data—take row values and turn them into columns.

Figure 2 Query Output

StatusCode NumberOfHits
---------- ------------
200        167262
304        6527
302        2998
404        2980
301        993
500        53
403        27
501        27
206        18
400        1

Pivoting the Data

As I mentioned, pivoting is not currently supported by Log Parser. However with a bit of ingenuity, you can do it yourself. The trick is to use the CASE statement in Log Parser. The Log Parser help file explains the CASE syntax like so:

CASE <field_expression>
  WHEN <field_expression> 
  THEN <field_expression>
  [ ... ]
  [ ELSE <field_expression> ]
 END

As an example, I could write a badly implemented version of the COUNT functionality using the CASE statement, like I have done in the top of Figure 4 (the resulting output is shown in the bottom of the figure).

Figure 4 Query Using CASE Statement

SELECT 
TO_LOCALTIME(TO_TIMESTAMP(date,time)) AS LocalTime, 
SUM(_200) AS 200s
USING 
CASE sc-status WHEN 200 THEN 1 ELSE 0 END AS _200
FROM 
%inputfile%
GROUP BY 
LocalTime 
ORDER BY 
LocalTime

Output
LocalTime           200s
------------------- -----
2005-11-16 09:59:59 68
2005-11-16 10:00:00 153
2005-11-16 10:00:01 157
2005-11-16 10:00:02 184
2005-11-16 10:00:03 189
2005-11-16 10:00:04 172

By extrapolation, I add a few more CASE statements (one for each sc-status I want to support). The query in Figure 5 implements this for the sc-status codes you saw before, and the output is shown in Figure 6 in a format that is much easier for a human to analyze.

Figure 6 East-to-Read Format

LocalTime           200s 206s 301s 302s 304s 400s 403s 404s 500s 501s
------------------- ---- ---- ---- ---- ---- ---- ---- ---- ---- ----
2005-11-16 09:59:59 68   0    0    1    3    0    0    2    0    0
2005-11-16 10:00:00 153  0    2    2    7    0    0    2    0    0
2005-11-16 10:00:01 157  0    0    6    8    0    0    5    0    0
2005-11-16 10:00:02 184  0    3    3    8    0    0    2    0    0
2005-11-16 10:00:03 189  0    1    4    5    0    0    4    0    0
2005-11-16 10:00:04 172  0    3    5    3    0    1    5    0    0
2005-11-16 10:00:05 182  0    2    2    6    0    0    1    0    0
2005-11-16 10:00:06 214  0    0    1    10   0    0    3    0    1
2005-11-16 10:00:07 192  0    1    1    6    0    0    2    0    0
2005-11-16 10:00:08 189  0    0    6    4    0    0    1    0    0

Figure 5 New CASE Query

SELECT 
TO_LOCALTIME(TO_TIMESTAMP(date,time)) AS LocalTime, 
SUM(_200) AS 200s, 
SUM(_206) AS 206s, 
SUM(_301) AS 301s, 
SUM(_302) AS 302s, 
SUM(_304) AS 304s, 
SUM(_400) AS 400s, 
SUM(_403) AS 403s, 
SUM(_404) AS 404s, 
SUM(_500) AS 500s, 
SUM(_501) AS 501s 
USING 
CASE sc-status WHEN 200 THEN 1 ELSE 0 END AS _200,
CASE sc-status WHEN 206 THEN 1 ELSE 0 END AS _206,
CASE sc-status WHEN 301 THEN 1 ELSE 0 END AS _301,
CASE sc-status WHEN 302 THEN 1 ELSE 0 END AS _302,
CASE sc-status WHEN 304 THEN 1 ELSE 0 END AS _304,
CASE sc-status WHEN 400 THEN 1 ELSE 0 END AS _400,
CASE sc-status WHEN 403 THEN 1 ELSE 0 END AS _403,
CASE sc-status WHEN 404 THEN 1 ELSE 0 END AS _404,
CASE sc-status WHEN 500 THEN 1 ELSE 0 END AS _500,
CASE sc-status WHEN 501 THEN 1 ELSE 0 END AS _501
FROM 
%inputfile%
GROUP BY 
LocalTime 
ORDER BY 
LocalTime

Now that you can pivot the data and get it into the structure that is most meaningful for your purposes, it is not terribly difficult to apply this technique to other data problems, like finding the most frequently used virtual directories over a given period of time. Given that you’re doing nothing more than looking for column names within rows of a data set, Log Parser should certainly be able to help automate this task. While it is possible to use Log Parser’s template output format to get close (you need to edit a couple of characters manually), the most hands-off approach would be to use a script.

Script-Generated Queries

The logic for a script to automatically generate these pivot queries is fairly simple. You need a few pieces of input: the column you want to pivot on and the series you want to pivot across. In the example, I want to pivot on the sc-status column, while the series is the LocalTime. I then use Log Parser’s scripting interface to parse the input log (or data you want pivoted) and build the new query based on the values in that input. This requires two parses of the input to get the pivoted data, though the resulting query can be saved and used at a later date. Figure 7 shows an example called PivotBuilder.js that does exactly that, though it is slightly expanded to allow you to add TOP and INTO statements to the query that is produced.

Figure 7 PivotBuilder.js

var Matches = new Array();
var querySnippet = null;
var seriesQuery = null;
var seriesTitle = null;
//Parse arguments first
var szInputLogFilename=null;
var szArgs = WScript.Arguments;
var topVal = null;
var intoVal = null;


if(szArgs.length<1)
{
    WScript.Echo("Usage:");
    WScript.Echo(" -log:<log_filename> -query:<queryToUse> 
-series:<seriesQuery> -sTitle:<SeriesTitle> [-top:<MaxNumColumns> 
-into:<Value to set as INTO in Generated query>]");
    WScript.Quit(-2);
}

for (var i=0; i < szArgs.length; i++)
{
    if(szArgs(i).substr(0,5).toLowerCase()=="-log:")
    {
        szInputLogFilename=szArgs(i).substr(5);
    }
    else if(szArgs(i).substr(0,7).toLowerCase()=="-query:")
    {
        querySnippet=szArgs(i).substr(7);
    }
    else if(szArgs(i).substr(0,5).toLowerCase()=="-top:")
    {
        topVal=szArgs(i).substr(5);
    }
    else if(szArgs(i).substr(0,6).toLowerCase()=="-into:")
    {
        intoVal=szArgs(i).substr(6);
    }
    else if(szArgs(i).substr(0,8).toLowerCase()=="-series:")
    {
        seriesQuery=szArgs(i).substr(8);
    }
    else if(szArgs(i).substr(0,8).toLowerCase()=="-stitle:")
    {
        seriesTitle=szArgs(i).substr(8);
    }
    else
    {
        WScript.Echo("Unrecognized argument: " + szArgs(i));
        WScript.Quit(-1);
    }
}

if( szInputLogFilename==null || querySnippet == null || seriesTitle == null)
{
    WScript.Echo("Not all the required parameters have been specified");
    WScript.Quit(-1);
}
//Create the main Log Parser Query object
var myQuery=new ActiveXObject("MSUtil.LogQuery");

//Create the text of the query
//Note we assume that querySnippet is a TO_STRING-able thing, if not this 
//will fail

var szQuery =     "SELECT "+((topVal==null)?"":"TOP "+topVal)+ " "+((casesensative)?"":"TO_LOWERCASE(") + "TO_STRING("+querySnippet+")" + ((casesensative)?"":")") +" AS columnQuery, COUNT(*) FROM " + szInputLogFilename +" GROUP BY columnQuery ORDER BY COUNT(*) DESC";

outputQuery = "";
var recordSet=myQuery.Execute(szQuery);

for(; !recordSet.atEnd(); recordSet.moveNext())
    {
    record=recordSet.getRecord();
    Matches.push(record.Getvalue(0));
    }

outputQuery = outputQuery +"SELECT "+seriesQuery+" AS "+seriesTitle+", "+ "\r\n";
for(i=0;i<Matches.length;i++)
{
    outputQuery = outputQuery +"    SUM([_"+Matches[i]+"]) AS ["+Matches[i]+"s]" + (((i<Matches.length - 1)?",":"")+ "\r\n");
}
outputQuery = outputQuery +"USING "+ "\r\n";
for(i=0;i<Matches.length;i++)
{
    outputQuery = outputQuery +"    CASE "+((casesensative)?"":"TO_LOWERCASE(") + "TO_STRING("+querySnippet+")"+((casesensative)?"":")") + " WHEN ‘"+Matches[i]+"’ THEN 1 ELSE 0 END AS [_"+Matches[i]+"]"+((i<Matches.length - 1)?",":"") + "\r\n";
}
outputQuery = outputQuery + ((intoVal==null)?"":" INTO "+intoVal) + " FROM %inputfile% GROUP BY " + seriesTitle + " ORDER BY "+ seriesTitle+ "\r\n";
WScript.Echo(outputQuery );

To run the script (PivotBuilder.js), you have to issue the following command line, which takes a number of important parameters. This particular command will cause PivotBuilder.js to run the query to produce the same output as you saw in Figure 6:

cscript /nologo PivotBuilder.js 
-log:ex*.log -query:sc-status -series:
"TO_LOCALTIME(TO_TIMESTAMP(date,time))"
-stitle:LocalTime

Figure 8 shows the Log Parser query that the script generated when run with the aforementioned parameters, which comes out to be pretty similar to the hand crafted one in Figure 5. And the output from this query looks something like Figure 9.

Figure 9 Output Generated from Scripting Query

LocalTime           200s 304s 302s 404s 301s 500s 403s 501s 206s 400s
------------------- ---- ---- ---- ---- ---- ---- ---- ---- ---- ----
2005-11-16 09:59:59 68   3    1    2    0    0    0    0    0    0
2005-11-16 10:00:00 153  7    2    2    2    0    0    0    0    0
2005-11-16 10:00:01 157  8    6    5    0    0    0    0    0    0
2005-11-16 10:00:02 184  8    3    2    3    0    0    0    0    0
2005-11-16 10:00:03 189  5    4    4    1    0    0    0    0    0
2005-11-16 10:00:04 172  3    5    5    3    0    1    0    0    0
2005-11-16 10:00:05 182  6    2    1    2    0    0    0    0    0
2005-11-16 10:00:06 214  10   1    3    0    0    0    1    0    0
2005-11-16 10:00:07 192  6    1    2    1    0    0    0    0    0
2005-11-16 10:00:08 189  4    6    1    0    0    0    0    0    0

Figure 8 Query Generated by PivotBuilder.js

SELECT TO_LOCALTIME(TO_TIMESTAMP(date,time)) AS LocalTime,
    SUM([_200]) AS [200s],
    SUM([_304]) AS [304s],
    SUM([_302]) AS [302s],
    SUM([_404]) AS [404s],
    SUM([_301]) AS [301s],
    SUM([_500]) AS [500s],
    SUM([_403]) AS [403s],
    SUM([_501]) AS [501s],
    SUM([_206]) AS [206s],
    SUM([_400]) AS [400s]
USING
    CASE TO_STRING(sc-status) WHEN ‘200’ THEN 1 ELSE 0 END AS [_200],
    CASE TO_STRING(sc-status) WHEN ‘304’ THEN 1 ELSE 0 END AS [_304],
    CASE TO_STRING(sc-status) WHEN ‘302’ THEN 1 ELSE 0 END AS [_302],
    CASE TO_STRING(sc-status) WHEN ‘404’ THEN 1 ELSE 0 END AS [_404],
    CASE TO_STRING(sc-status) WHEN ‘301’ THEN 1 ELSE 0 END AS [_301],
    CASE TO_STRING(sc-status) WHEN ‘500’ THEN 1 ELSE 0 END AS [_500],
    CASE TO_STRING(sc-status) WHEN ‘403’ THEN 1 ELSE 0 END AS [_403],
    CASE TO_STRING(sc-status) WHEN ‘501’ THEN 1 ELSE 0 END AS [_501],
    CASE TO_STRING(sc-status) WHEN ‘206’ THEN 1 ELSE 0 END AS [_206],
    CASE TO_STRING(sc-status) WHEN ‘400’ THEN 1 ELSE 0 END AS [_400]
 FROM %inputfile% GROUP BY LocalTime ORDER BY LocalTime

Obviously this is not a perfect world, since if your values change from input to input you would have to rerun this script to produce the correct query for the log, but this is a lot easier than rewriting the script for each one by hand. If you want, you could add or edit a few lines of PivotBuilder.js to have it call Log Parser with the query that you built up (instead of using WScript.Echo to let you see the query, you would save the query into a variable and then call Log Parser to execute the query). You can download a copy of this script from the TechNet Magazine site.

Wrap-Up

I have used Log Parser when working with more than a hundred values (so more than one hundred columns in the output) and it has worked quite well. For fun I took the number of columns in a particular query to 1,000 and while it worked, I didn’t completely verify all the output (it also took a while to run). However in most cases, you will only need a few columns and can use the "–top:" parameter with PivotBuilder.js.

While this works great for pivoting certain kinds of data (specifically counts), it doesn’t work as well when the content of the field you want is a string or other value. In that case you need to write a custom input parser COM object. While this does involve writing some code, I’ve found that it isn’t really a daunting task if I take advantage of the simplicity involved in using a scripting language to produce COM objects. QFE.wsc was included in the sample scripts for Log Parser. It is a good example of how to implement a custom input type using a scripting language. By using the versatility and extendibility inherent in Log Parser, it is possible to produce more readable outputs to allow for easier data mining.

Log Parser Version 2.2 Sample Commands

Here’s a sampling of LogParser functions to give you an idea of what kind of data you can extract using LogParser.

Print the 10 largest files on the C: drive:

LogParser "SELECT TOP 10 * 
FROM C:\*.* ORDER BY Size DESC" -i:FS

Create an XML report file containing logon account names and dates from the Security Event Log messages:

LogParser "SELECT TimeGenerated AS LogonDate, 
EXTRACT_TOKEN(Strings, 0, ‘|’)
AS Account INTO Report.xml FROM Security 
WHERE EventID NOT IN (541;542;543)
AND EventType = 8 AND EventCategory = 2"

Load a portion of the registry into a SQL table:

LogParser "SELECT * INTO MyTable 
FROM \HKLM" -i:REG -o:SQL -server:MyServer
-database:MyDatabase 
-driver:"SQL Server" 
-username:TestSQLUser
-password:TestSQLPassword -createTable:ON

Parse the output of a netstat command:

netstat | LogParser "SELECT * FROM STDIN"
 -i:TSV -iSeparator:space -nSep:2
-fixedSep:off -nSkipLines:3

Display users’ job title breakdown from Active Directory:

LogParser "SELECT title, 
MUL(PROPCOUNT(*), 100.0) AS Percentage INTO
DATAGRID FROM 
‘LDAP://myusername:mypassword@mydomain/CN=Users,
DC=mydomain, DC=com’ WHERE title IS NOT NULL 
GROUP BY title ORDER BY Percentage DESC"
-objClass:user

Get the distribution of EventID values for each Source:

LogParser "SELECT SourceName, 
EventID, MUL(PROPCOUNT(*) ON (SourceName),
100.0) AS Percent FROM System 
GROUP BY SourceName, EventID ORDER BY
SourceName, Percent DESC"

Display total network traffic bytes per second:

LogParser "SELECT QUANTIZE(DateTime, 1) 
AS Second, SUM(FrameBytes) INTO
DATAGRID FROM myCapture.cap GROUP BY Second"

Display the distribution of registry value types:

LogParser "SELECT ValueType, COUNT(*) 
INTO DATAGRID FROM \HKLM GROUP BY
ValueType"

Display titles of current channels on MSDN blogs:

LogParser "SELECT title INTO MyOutput.txt FROM
https://blogs.msdn.com/MainFeed
.aspx#/rss/channel/item" -i:XML 
-fMode:Tree -o:tpl -tpl:mytemplate.tpl

Jeffrey A. Johnson is a Senior Systems Engineer with Microsoft.com and spends his days analyzing and debugging Web server-related issues. As a veteran of the IIS test team, he was the unofficial tester for Log Parser and has been using it since before its public release. For more information contact mscomblg@microsoft.com.