Inside Microsoft.com: Analyzing Denial of Service Attacks

Article
06/05/2009

Inside Microsoft.com Analyzing Denial of Service Attacks

Paul Wright

Download the code for this article: Log Parser 2.2 (1,400KB)

If you’ve been following this column, you know how big Microsoft.com is. If not, you can probably imagine. On average, the site reaches over 13 million unique users per day, and about 289 million per month. Coupled with over 200 million unique users per day to Microsoft Update, plus downloads and 110+ smaller sites, you can just imagine the tremendous amount of data that must be captured and analyzed by our team every day.

With all of that data, you might wonder how we determine which traffic patterns represent an attack and which are normal activity. One approach the Microsoft.com Operations team employs is to perform security-related baselining and analysis using Log Parser and Netmon. In this column I’ll outline some of the data the team gathers with these two tools and show how we look at the data to help defend against denial of service attacks against our sites. Of course, the type of traffic, applications, and network characteristics of your environment will differ, but this should provide an initial foundation that you can leverage for your needs.

The Sample Scenario

Although Log Parser can be used for a multitude of different file formats, the examples provided here will target Netmon .cap files, the IIS W3C log file format, HTTP.SYS, and the URLScan log file format. The Log Parser examples will have the initial Log Parser command line at the top of the section with the SQL statements following for each particular view of the data. With the SQL statements in separate files, the Log Parser statement can then be updated with the new SQL file name to generate the desired output.

The output of the examples is targeted to the DataGrid option of Log Parser as this is easy to consume when first looking at the query output; however, the INTO clause in each SQL statement can be adjusted to send output to a CSV file or SQL Server database instance or removed to send data to the console as desired.

From these examples you can explore and update the SQL statements in the files to gather other information or views of the data as you see fit to baseline or analyze your environment and applications. The author of Log Parser and support community share information and helpful hints at www.logparser.com.

Baselining Current Traffic

Let's look at two scenarios in which having strong baseline data helps in determining the cause of traffic anomalies. In the first situation, the baseline data helps you set thresholds for monitoring and alerting to allow for proactive action should your systems show signs of anomalous traffic. In the second situation you find yourself being reactive rather than proactive because your site went down due to a sudden large event. The deeper the understanding of normal baselines and current trends, the more rapid and conclusive your analysis will be during any given anomalous event.

The key thing to know is at what frequency you should collect traffic data for comparison. For instance, if traffic patterns vary greatly from day to day, a weekly view sampled over multiple weeks throughout the year would be in order. If the traffic varies within the day, but those patterns generally repeat from day to day, a 24-hour view sampled throughout the year might be enough. In addition, keep in mind that normal flash traffic may also impact your baseline numbers (think holiday shopping season on an e-commerce site).

While the following techniques can be used in a number of different scenarios, we will review baselining data from the network and application layers for Web sites and services. We will also discuss the other important aspect of baseline data, which is keeping that data up to date.

Please do keep in mind that we are not attempting to cover every aspect of baselining a particular site or application’s behavior. For instance, Perfmon-related counters relevant to your particular site or application are very important to the overall baseline, but we will leave those for another discussion.

Web-Based Network Traffic

The goal of baselining network traffic is to understand the key characteristics of the client behaviors from a TCP/IP perspective. Microsoft.com Operations reduces this data to total unique IPs, TCPFlag distribution, and bandwidth utilization for the given client traffic.

We start off with a raw Netmon capture from the Internet-facing network interface card (NIC) of several of our servers for a given site or application. Remember that the default capture buffer is only 1MB so you will need to expand the buffer to gather a significant sample. For instance, we use 500MB samples from several servers to gather enough data to ensure that we can accurately baseline traffic characteristics.

The other key point to remember is that capturing traffic for a set amount of time is important in order to ensure apples-to-apples comparisons for some of the measurements. Generally we use rolling one-hour captures, but this interval could be larger or smaller as deemed appropriate.

Here is a Log Parser syntax for Netmon parsing:

Logparser.exe file:<LogParserSQLFileName>.sql?capfile=capture.cap -i:netmon

The following Log Parser example will provide the percentage of packets per each TCP Flag type. Figure 1 shows the output from this query:

SELECT
    TCPFlags,
    Count(*) AS Packets,
    TO_INT(MUL(100.0,PropCount(*))) AS 
        Percent
INTO DATAGRID
FROM %capfile%
GROUP BY TCPFlags
ORDER BY Packets DESC

This view of the data is key to understanding SYN, ACK, and other floods that might be coming your way.

Figure 1 TCPFlag Types

TCPFlags	Packets	Percent
A	11531	66
AP	2970	17
AS	798	5
S	788	4
APF	44	0

For normal traffic (no TCP-based attacks in progress), you might expect a TCPFlag distribution like that in Figure 1. As you see, during normal traffic for this particular application, the percentage of SYNs (indicated by the TCPFlag "S") is significantly lower than the percentage of ACKs (indicated by the TCPFlag "A"). As you’ll see later, this will change dramatically for a SYN attack.

Next we baseline the SYNs/SEC/Src IP to help understand the normal distribution of SYNs for IP addresses hitting the systems. This is interesting from a normal client behavior perspective as well as understanding the behavior and characteristics of clients fronted by proxy and Network Address Translation (NAT) devices where the actual source IP of each client is masked by those devices. The actual IP addresses are not relevant for the baseline. The following query shows HTTP traffic, but it can be updated to cover any relevant TCP port. Figure 2 shows the output from the query.

SELECT 
    SrcIP,
    Count(*) AS TotalSYNs,
    DIV(TO_REAL(TotalSYNs),
        SUB(
            MAX(TO_REAL(DateTime)),
            MIN(TO_REAL(DateTime))
    )
    ) As SYNsPerSec 
INTO DATAGRID
FROM %capfile%
WHERE DstPort = 80 AND TCPFlags = ‘S’
GROUP BY SrcIP 
ORDER BY SYNsPerSec DESC

Figure 2 HTTP Traffic

SrcIP	TotalSYNs	SYNsPerSec
10.1.1.1	4	0.297674
10.2.2.2	4	2.348624
10.3.3.3	4	1.79021

Figure 2 also shows that per-second SYN rates can appear more significant than reality so remember to review both the total SYNs and per-second numbers. In addition, for some applications it is interesting to gather data on the normal bandwidth consumed by a given connection. For this example, we will capture data for outbound and inbound traffic separately. Figure 3 shows the queries for both the outbound and inbound traffic.

Figure 3 Bandwidth Consumption

OutboundBandwidthPerConnection

SELECT
    DstIP,
    TO_REAL(SUM(FrameBytes)) AS TotalBytes_Out,
    DIV(
        TotalBytes_Out,
        SUB(
            MAX(TO_REAL(DateTime)),
            MIN(TO_REAL(DateTime))
        )
    ) AS BytesPerSec_Out,
    CASE SrcPort
        WHEN 80 THEN ‘HTTP’
        WHEN 443 THEN ‘SSL’        
    END AS Protocol
INTO DATAGRID
FROM %capfile%
WHERE (SrcPort=80 or SrcPort=443)
GROUP BY SrcIP,DstIP,SrcPort
ORDER BY TotalBytes_Out DESC

InboundBandwidthPerConnection

SELECT
    SrcIP,
    TO_REAL(SUM(FrameBytes)) AS TotalBytesIn,
    DIV(
        TotalBytesIn,
        SUB(
            MAX(TO_REAL(DateTime)),
            MIN(TO_REAL(DateTime))
        )
    ) AS BytesPerSec_In,
    CASE DstPort
        WHEN 80 THEN ‘HTTP’
        WHEN 443 THEN ‘SSL’
    END AS Protocol
INTO DATAGRID
FROM %capfile%
WHERE (DstPort=80 or DstPort=443)
GROUP BY SrcIP,DstIP,DstPort
ORDER BY TotalBytesIn DESC

These are the main factors that we profile from Netmon captures, but there are a number of additional options available. Dig in and explore the other views relevant to your application.

IIS Logs

In analyzing IIS logs we will baseline top request characteristics, unique client IPs, top client IPs, User Agent string characteristics, IP to User Agent characteristics, and total number of requests to understand site behaviors at the Web application layer.

The following shows the Log Parser Syntax for this section:

Logparser.exe file:sample.sql?logfile=\\
servername\wwwlog\w3svc1\ex050101*.log

Note that the asterisk in the previous line will cause Log Parser to use all logs for January 1, 2005. Changing the log file destination to ex0501*.log will cover all logs for January 2005. This is one method that can be used to cover a given time period after you determine what time frame must be used for your baseline data.

Gathering the unique request hit count gives a solid baseline for the number of hits to a given set of pages. If this number shifts significantly, it can indicate a targeted attack against a given page. The following is the query to get the count of URI hits:

SELECT 
      cs-uri-stem AS URI-Stem, 
      cs-host AS HostName, 
      COUNT(*) AS Hits
INTO DATAGRID
FROM %logfile%
GROUP BY cs-uri-stem, cs-host 
ORDER BY Hits DESC

To understand the normal pattern of requests to a site, one of the first pieces of data to look at is the normal distribution of clients to requests, which you can ascertain from the following query:

SELECT 
    c-ip AS ClientIP,
    cs-host AS HostName
    cs-uri-stem AS URIStem,
    sc-status AS Status,
    cs(User-Agent) AS UserAgent,
    count (*) as Requests
INTO DATAGRID
FROM %logfile%
GROUP BY c-ip, cs-uri-stem, cs-host, cs(User-Agent), sc-status 
ORDER BY Requests DESC

Figure 4 shows some typical output. Note that the User-Agent field has been shortened for space.

Figure 4 Gathering Host Names

ClientIP	HostName	URIStem	Status	UserAgent	Requests
10.1.1.1	microsoft.com	/default.aspx	200	Windows+XP...	289
10.1.1.2	microsoft.com	/default.aspx	200	Windows+XP...	228
10.1.1.4	microsoft.com	/default.aspx	200	Mozilla/4.0...	216

The CIP-URI.SQL query may also be updated to include the use of the reverse DNS function of Log Parser by inserting REVERSEDNS(c-ip) in the SELECT clause. This can be quite helpful in identifying things like patterns of abusing domains. However, this feature is quite slow when dealing with large log files, so it may make sense to include this as a secondary parsing of the results that are of most concern rather than an entire log file.

These numbers are useful both alone and when compared to each other. A shift in either number as compared to a baseline can help validate whether an attack is in progress and provide further details to investigate, as shown here:

SELECT 
    COUNT(DISTINCT c-ip) AS UniqueIPs,
    COUNT(ALL c-ip) AS TotalRequests
INTO DATAGRID
FROM %logfile%

The next three queries, hits by user agent string, status code, and method, generally provide good backup data when investigating an issue. It can be very beneficial to validate findings during an event against this data. For instance, if you see a relatively normal distribution of the number of requests, but there appears to be an odd number of POSTs, comparing the requests to your baseline data will indicate if something unusual is happening at the moment. The following query gathers the normal distribution of requests by method:

SELECT 
    cs-method AS Method, 
    COUNT (*) AS Hits
INTO DATAGRID
FROM %logfile% 
GROUP BY cs-method
ORDER BY Hits DESC

You can extrapolate this to any of the other fields available. We’ve done this for User Agent and Status Code in Figure 5 as they are generally interesting for our purposes. The code in Figure 5 can be expanded or contracted by modifying the 100 or removing the Top 100 altogether.

Figure 5 Hits By Method

UserAgentHits

SELECT 
    cs(User-Agent) AS UserAgent, 
    COUNT (*) AS Hits
INTO DATAGRID
FROM %logfile% 
GROUP BY UserAgent
ORDER BY Hits DESC

URIStemCountRollup

SELECT TOP 100 
    cs-uri-stem,
    sc-status,
    COUNT(*) AS Hits
INTO DATAGRID
FROM %logfile%
GROUP BY cs-uri-stem,sc-status 
ORDER BY Hits DESC

Methods

SELECT 
    cs-method AS Method,
    COUNT (*) AS Hits
INTO DATAGRID
FROM %logfile% 
GROUP BY cs-method
ORDER BY Hits DESC

HTTP.SYS Error Logs

Baselining the HTTP.SYS error log is essential to ensuring any inbound attack traffic does not fly under your radar. For instance, should an attacker send a large number of requests that are rejected by HTTP.SYS, the IIS log parsing mentioned earlier would not show the issues during an investigation.

See Knowledge Base article 832975 ("Additional properties are now available for logging in the Httperr#.log file in IIS 6.0"). HTTP.sys Reason Phrases can also be found at "HTTP.sys Error Log Files (IIS 6.0)".

The next metric we collect, invalid requests per client IP, is very similar to the number of normal requests per client IP as gathered from the IIS logs. These just represent the normal errors for your application or system at a given time. The following query will give you the errors and Figure 6 is an example of the output you’ll get.

SELECT 
    c-ip AS ClientIP,
    Count(*) as Errors
INTO DATAGRID
FROM %logfile% 
GROUP BY c-ip
ORDER BY Errors DESC

The following is the query for invalid requests per failure reason and Figure 7 shows the output:

SELECT 
    s-reason AS ErrorReason,
    Count(*) as Errors
INTO DATAGRID
FROM %logfile% 
GROUP BY s-reason
ORDER BY Errors DESC

Because the HTTP.SYS logs are set to roll based on size rather than time, the following baseline data provides distribution by hour that would otherwise be gathered by looking at a single log (assuming logs are set to roll once per hour). Figure 8 shows the output from this query:

SELECT 
    QUANTIZE(
             TO_TIMESTAMP(Date,Time),
             3600) 
    as Hour,
    Count(*) as Errors
INTO DATAGRID
FROM %logfile% 
GROUP BY Hour

Figure 8 Errors by Hour

Hour	Errors
10/25/2005 8:00	62771
10/25/2005 9:00	141144
10/25/2005 15:00	62007

Figure 7 Failure Reason

ErrorReason	Errors
Timer_ConnectionIdle	1131657
Timer_MinBytesPerSecond	5227
Hostname	3

Figure 6 Errors

ClientIP	Errors
10.1.1.1	5470
10.2.2.2	1769
10.6.6.6	544

URLScan Logs

URLScan logs should be baselined for the same reasons you baseline HTTP.SYS error logs. Following is the query for total rejected requests per client IP:

SELECT 
    c-ip as ClientIP,
    Count (*) as RejectedRequests
INTO DATAGRID
FROM %logfile% 
GROUP BY ClientIP
ORDER BY RejectedRequests DESC

The following query will show you the total rejected requests by URL. The output is shown in Figure 9.

Figure 9 Rejected URLs

URL	RejectedRequests
/_vti_bin/shtml.exe/_vti_rpc	92
/money/headlines/8.0/us/suite/qperm.exe	37
/setup.exe	24
/money/headlines/8.0/us/suite/npcupdt.exe	23
/windows/ie/download/.css	10

SELECT 
    cs-uri as URL,
    Count (*) as RejectedRequests
INTO DATAGRID
FROM %logfile% 
GROUP BY URL
ORDER BY RejectedRequests DESC

Analyzing an Attack or Event

Analysis starts with gathering the equivalent data related to the baselines discussed here. To this end, rerun the data gathering queries I've outlined and compare the data. Saving the example queries as well as any new or modified queries and processes you have come up with makes this process more repeatable and is highly recommended. The more straightforward and repeatable the processes, the more likely the analysis will prove useful.

Comparison of anomalous data against baselines generally involves comparing the associated numbers against each other for the given queries and then digging in from there. For instance, the TCPFlag distribution query output example showed a 4 percent SYN distribution for normal traffic periods. Doing this same data collection during an event might show something like you see in Figure 10. Given the wide shift of SYN and SYN-ACK packets, we would begin defending based on this being a SYN flood. But what if the data isn’t so clear?

Figure 10 TCPFlags

TCPFlags	Packets	Percent
TCPFlags	Packets	Percent
S	170286	56
AS	118508	39
R	13799	5
AR	207	0
AP	50	0
A	43	0
AF	8	0
APF	4	0
APR	2	0

Digging in Deeper

Based on the data from a given query, you may want to use more specific queries to find more information about the client or clients making requests at anomalous rates. For instance, you have gone through the work of automating alerting on SYNs/SEC/SrcIP and the system has detected a large spike in the number of SYNs for several hosts. The output of the SYNsPerSecPerSrcIP query might look something like Figure 11.

Figure 11 SYNs per Second

SrcIP	TotalSYNs	SYNsPerSec
10.1.1.1	10	58.181818
10.2.2.2	8	64
10.3.3.3	7	44.8

Since the number of SYNs per client IP is not that high, we might be looking at a full HTTP GET attack or a distributed SYN flood where each IP does not send a significant number of SYNs. Since we have discussed the anomalous TCPFlags distribution scenario, we will assume the distribution in this new example is actually within the norms. Based on that, a review of the actual requests reaching IIS, HTTP.SYS, and URLScan logs would be in order.

By using the CIP-URI.sql query to look at the IIS logs, with a modification to grab data specific to the particular IP address that we suspect is causing issues, we can quickly determine what that client is doing that might be anomalous.

The SQL file should now contain an update with a WHERE clause for the specific IP address or addresses in question, like so:

SELECT 
    c-ip AS ClientIP, 
    cs-uri-stem AS URIStem, 
    sc-status AS Status, 
    cs-host AS HostName, 
    cs(User-Agent) AS UserAgent, 
    count (*) as Requests
INTO DATAGRID
FROM %logfile%
WHERE c-ip IN (‘10.1.1.1’;’10.2.2.2’)
GROUP BY c-ip, cs-uri-stem, cs-host, cs(User-Agent), 
   sc-status
ORDER BY Requests DESC

This will show data about the actual requests these two clients are making as well as the number of requests. From there you can make decisions about what actions to take to mitigate the issue.

Keep in mind that while you may notice a SYN flood or full HTTP GET request flood, there may be another attack happening behind the scenes. This query might show this if the requests are making it up to IIS; however, at this point it is useful to look at the traffic that the given clients are sending that is not making it to IIS.

For this we would update the URLScan and HTTP.SYS queries to zero in on the specific IPs as well as running the general comparison. As an example of one of the more specific queries, we might modify the HTTPSYSInvalidRequestsPerClientIP.sql query to include the error reason as follows:

SELECT 
    c-ip AS ClientIP, 
    s-reason AS ErrorReason, 
    Count(*) as Errors
INTO DATAGRID
FROM %logfile% 
WHERE c-ip = ‘10.1.1.1’
GROUP BY c-ip, s-reason
ORDER BY Errors DESC

Further digging will be required to determine all of the requests a particular client is making against your systems and applications, but these examples give you an idea of the approach you can take to get there.

What’s Next

Clearly, gathering the baseline data just once will not provide significant benefit and manually gathering the data multiple times can be prohibitive given the many other tasks required to keep a site up and running. The optimal solution is to automate data gathering, storage, and reporting. Depending on the resources available, the level of automation will vary greatly.

However, even with minimal resources, it is possible to automate the collection of a great deal of this baseline and event data. For instance, using the INTO clause from Log Parser to inject the data into a SQL database and running the Log Parser commands from scripts in scheduled tasks provide the rudimentary data collection automation necessary to make this task reasonable from a time investment and the return on investment is huge if there is an anomalous event you must investigate.

Keep in mind that you will also need to investigate other information sources to understand the full picture. You also must collect the right server performance data, such as CPU and memory utilization, to fully understand the baseline performance characteristics of a site. Each new application deployment will bring with it the need to baseline the new characteristics so tracking over time is essential to ensuring the correct and current outlook during an attack or event.

Wrapping It Up

I’ve covered some of the basic uses of Log Parser and Netmon for baselining normal traffic and for analyzing anomalous events. While this is by no means an exhaustive review, this information will help you expand your understanding of the traffic patterns and client behaviors on your site and help you prepare for anomalous events. Look for more information on this subject as well as other topics relating to our Microsoft.com operations team at blogs.technet.com/mscom.

The following members of the Microsoft.com team have contributed to this column: Paul Wright, Technology Architect Manager; Ivan Ivanov, Technology Architect; Guy Gordon, Program Manager; and Jeffrey A. Johnson, Senior Systems Engineer.