Inside Microsoft.com Analyzing Denial of Service Attacks
Download the code for this article: Log Parser 2.2 (1,400KB)
If you’ve been following this column, you know how big Microsoft.com is. If not, you can probably imagine. On average, the site reaches over 13 million unique users per day, and about 289 million per month. Coupled with over 200 million unique users per day to Microsoft Update, plus downloads and 110+ smaller sites, you can just imagine the tremendous amount of data that must be captured and analyzed by our team every day.
With all of that data, you might wonder how we determine which traffic patterns represent an attack and which are normal activity. One approach the Microsoft.com Operations team employs is to perform security-related baselining and analysis using Log Parser and Netmon. In this column I’ll outline some of the data the team gathers with these two tools and show how we look at the data to help defend against denial of service attacks against our sites. Of course, the type of traffic, applications, and network characteristics of your environment will differ, but this should provide an initial foundation that you can leverage for your needs.
The Sample Scenario
Although Log Parser can be used for a multitude of different file formats, the examples provided here will target Netmon .cap files, the IIS W3C log file format, HTTP.SYS, and the URLScan log file format. The Log Parser examples will have the initial Log Parser command line at the top of the section with the SQL statements following for each particular view of the data. With the SQL statements in separate files, the Log Parser statement can then be updated with the new SQL file name to generate the desired output.
The output of the examples is targeted to the DataGrid option of Log Parser as this is easy to consume when first looking at the query output; however, the INTO clause in each SQL statement can be adjusted to send output to a CSV file or SQL Server database instance or removed to send data to the console as desired.
From these examples you can explore and update the SQL statements in the files to gather other information or views of the data as you see fit to baseline or analyze your environment and applications. The author of Log Parser and support community share information and helpful hints at www.logparser.com.
Baselining Current Traffic
Let's look at two scenarios in which having strong baseline data helps in determining the cause of traffic anomalies. In the first situation, the baseline data helps you set thresholds for monitoring and alerting to allow for proactive action should your systems show signs of anomalous traffic. In the second situation you find yourself being reactive rather than proactive because your site went down due to a sudden large event. The deeper the understanding of normal baselines and current trends, the more rapid and conclusive your analysis will be during any given anomalous event.
The key thing to know is at what frequency you should collect traffic data for comparison. For instance, if traffic patterns vary greatly from day to day, a weekly view sampled over multiple weeks throughout the year would be in order. If the traffic varies within the day, but those patterns generally repeat from day to day, a 24-hour view sampled throughout the year might be enough. In addition, keep in mind that normal flash traffic may also impact your baseline numbers (think holiday shopping season on an e-commerce site).
While the following techniques can be used in a number of different scenarios, we will review baselining data from the network and application layers for Web sites and services. We will also discuss the other important aspect of baseline data, which is keeping that data up to date.
Please do keep in mind that we are not attempting to cover every aspect of baselining a particular site or application’s behavior. For instance, Perfmon-related counters relevant to your particular site or application are very important to the overall baseline, but we will leave those for another discussion.
Web-Based Network Traffic
The goal of baselining network traffic is to understand the key characteristics of the client behaviors from a TCP/IP perspective. Microsoft.com Operations reduces this data to total unique IPs, TCPFlag distribution, and bandwidth utilization for the given client traffic.
We start off with a raw Netmon capture from the Internet-facing network interface card (NIC) of several of our servers for a given site or application. Remember that the default capture buffer is only 1MB so you will need to expand the buffer to gather a significant sample. For instance, we use 500MB samples from several servers to gather enough data to ensure that we can accurately baseline traffic characteristics.
The other key point to remember is that capturing traffic for a set amount of time is important in order to ensure apples-to-apples comparisons for some of the measurements. Generally we use rolling one-hour captures, but this interval could be larger or smaller as deemed appropriate.
Here is a Log Parser syntax for Netmon parsing:The following Log Parser example will provide the percentage of packets per each TCP Flag type. Figure 1 shows the output from this query:This view of the data is key to understanding SYN, ACK, and other floods that might be coming your way.
Logparser.exe file:<LogParserSQLFileName>.sql?capfile=capture.cap -i:netmon
SELECT TCPFlags, Count(*) AS Packets, TO_INT(MUL(100.0,PropCount(*))) AS Percent INTO DATAGRID FROM %capfile% GROUP BY TCPFlags ORDER BY Packets DESC
For normal traffic (no TCP-based attacks in progress), you might expect a TCPFlag distribution like that in Figure 1. As you see, during normal traffic for this particular application, the percentage of SYNs (indicated by the TCPFlag "S") is significantly lower than the percentage of ACKs (indicated by the TCPFlag "A"). As you’ll see later, this will change dramatically for a SYN attack.
Next we baseline the SYNs/SEC/Src IP to help understand the normal distribution of SYNs for IP addresses hitting the systems. This is interesting from a normal client behavior perspective as well as understanding the behavior and characteristics of clients fronted by proxy and Network Address Translation (NAT) devices where the actual source IP of each client is masked by those devices. The actual IP addresses are not relevant for the baseline. The following query shows HTTP traffic, but it can be updated to cover any relevant TCP port. Figure 2 shows the output from the query.
SELECT SrcIP, Count(*) AS TotalSYNs, DIV(TO_REAL(TotalSYNs), SUB( MAX(TO_REAL(DateTime)), MIN(TO_REAL(DateTime)) ) ) As SYNsPerSec INTO DATAGRID FROM %capfile% WHERE DstPort = 80 AND TCPFlags = ‘S’ GROUP BY SrcIP ORDER BY SYNsPerSec DESC
Figure 2 also shows that per-second SYN rates can appear more significant than reality so remember to review both the total SYNs and per-second numbers. In addition, for some applications it is interesting to gather data on the normal bandwidth consumed by a given connection. For this example, we will capture data for outbound and inbound traffic separately. Figure 3 shows the queries for both the outbound and inbound traffic.
SELECT DstIP, TO_REAL(SUM(FrameBytes)) AS TotalBytes_Out, DIV( TotalBytes_Out, SUB( MAX(TO_REAL(DateTime)), MIN(TO_REAL(DateTime)) ) ) AS BytesPerSec_Out, CASE SrcPort WHEN 80 THEN ‘HTTP’ WHEN 443 THEN ‘SSL’ END AS Protocol INTO DATAGRID FROM %capfile% WHERE (SrcPort=80 or SrcPort=443) GROUP BY SrcIP,DstIP,SrcPort ORDER BY TotalBytes_Out DESC
SELECT SrcIP, TO_REAL(SUM(FrameBytes)) AS TotalBytesIn, DIV( TotalBytesIn, SUB( MAX(TO_REAL(DateTime)), MIN(TO_REAL(DateTime)) ) ) AS BytesPerSec_In, CASE DstPort WHEN 80 THEN ‘HTTP’ WHEN 443 THEN ‘SSL’ END AS Protocol INTO DATAGRID FROM %capfile% WHERE (DstPort=80 or DstPort=443) GROUP BY SrcIP,DstIP,DstPort ORDER BY TotalBytesIn DESC
These are the main factors that we profile from Netmon captures, but there are a number of additional options available. Dig in and explore the other views relevant to your application.
In analyzing IIS logs we will baseline top request characteristics, unique client IPs, top client IPs, User Agent string characteristics, IP to User Agent characteristics, and total number of requests to understand site behaviors at the Web application layer.
The following shows the Log Parser Syntax for this section:
Logparser.exe file:sample.sql?logfile=\\ servername\wwwlog\w3svc1\ex050101*.log
Note that the asterisk in the previous line will cause Log Parser to use all logs for January 1, 2005. Changing the log file destination to ex0501*.log will cover all logs for January 2005. This is one method that can be used to cover a given time period after you determine what time frame must be used for your baseline data.
Gathering the unique request hit count gives a solid baseline for the number of hits to a given set of pages. If this number shifts significantly, it can indicate a targeted attack against a given page. The following is the query to get the count of URI hits:
SELECT cs-uri-stem AS URI-Stem, cs-host AS HostName, COUNT(*) AS Hits INTO DATAGRID FROM %logfile% GROUP BY cs-uri-stem, cs-host ORDER BY Hits DESC
To understand the normal pattern of requests to a site, one of the first pieces of data to look at is the normal distribution of clients to requests, which you can ascertain from the following query:Figure 4 shows some typical output. Note that the User-Agent field has been shortened for space.
SELECT c-ip AS ClientIP, cs-host AS HostName cs-uri-stem AS URIStem, sc-status AS Status, cs(User-Agent) AS UserAgent, count (*) as Requests INTO DATAGRID FROM %logfile% GROUP BY c-ip, cs-uri-stem, cs-host, cs(User-Agent), sc-status ORDER BY Requests DESC
The CIP-URI.SQL query may also be updated to include the use of the reverse DNS function of Log Parser by inserting REVERSEDNS(c-ip) in the SELECT clause. This can be quite helpful in identifying things like patterns of abusing domains. However, this feature is quite slow when dealing with large log files, so it may make sense to include this as a secondary parsing of the results that are of most concern rather than an entire log file.
These numbers are useful both alone and when compared to each other. A shift in either number as compared to a baseline can help validate whether an attack is in progress and provide further details to investigate, as shown here:The next three queries, hits by user agent string, status code, and method, generally provide good backup data when investigating an issue. It can be very beneficial to validate findings during an event against this data. For instance, if you see a relatively normal distribution of the number of requests, but there appears to be an odd number of POSTs, comparing the requests to your baseline data will indicate if something unusual is happening at the moment. The following query gathers the normal distribution of requests by method:
SELECT COUNT(DISTINCT c-ip) AS UniqueIPs, COUNT(ALL c-ip) AS TotalRequests INTO DATAGRID FROM %logfile%
You can extrapolate this to any of the other fields available. We’ve done this for User Agent and Status Code in Figure 5 as they are generally interesting for our purposes. The code in Figure 5 can be expanded or contracted by modifying the 100 or removing the Top 100 altogether.
SELECT cs(User-Agent) AS UserAgent, COUNT (*) AS Hits INTO DATAGRID FROM %logfile% GROUP BY UserAgent ORDER BY Hits DESC
SELECT TOP 100 cs-uri-stem, sc-status, COUNT(*) AS Hits INTO DATAGRID FROM %logfile% GROUP BY cs-uri-stem,sc-status ORDER BY Hits DESC
HTTP.SYS Error Logs
Baselining the HTTP.SYS error log is essential to ensuring any inbound attack traffic does not fly under your radar. For instance, should an attacker send a large number of requests that are rejected by HTTP.SYS, the IIS log parsing mentioned earlier would not show the issues during an investigation.
See Knowledge Base article 832975 ("Additional properties are now available for logging in the Httperr#.log file in IIS 6.0"). HTTP.sys Reason Phrases can also be found at "HTTP.sys Error Log Files (IIS 6.0)".
The next metric we collect, invalid requests per client IP, is very similar to the number of normal requests per client IP as gathered from the IIS logs. These just represent the normal errors for your application or system at a given time. The following query will give you the errors and Figure 6 is an example of the output you’ll get.The following is the query for invalid requests per failure reason and Figure 7 shows the output:Because the HTTP.SYS logs are set to roll based on size rather than time, the following baseline data provides distribution by hour that would otherwise be gathered by looking at a single log (assuming logs are set to roll once per hour). Figure 8 shows the output from this query:
SELECT c-ip AS ClientIP, Count(*) as Errors INTO DATAGRID FROM %logfile% GROUP BY c-ip ORDER BY Errors DESC
SELECT s-reason AS ErrorReason, Count(*) as Errors INTO DATAGRID FROM %logfile% GROUP BY s-reason ORDER BY Errors DESC
SELECT QUANTIZE( TO_TIMESTAMP(Date,Time), 3600) as Hour, Count(*) as Errors INTO DATAGRID FROM %logfile% GROUP BY Hour
URLScan logs should be baselined for the same reasons you baseline HTTP.SYS error logs. Following is the query for total rejected requests per client IP:The following query will show you the total rejected requests by URL. The output is shown in Figure 9.
SELECT c-ip as ClientIP, Count (*) as RejectedRequests INTO DATAGRID FROM %logfile% GROUP BY ClientIP ORDER BY RejectedRequests DESC
SELECT cs-uri as URL, Count (*) as RejectedRequests INTO DATAGRID FROM %logfile% GROUP BY URL ORDER BY RejectedRequests DESC
Analyzing an Attack or Event
Analysis starts with gathering the equivalent data related to the baselines discussed here. To this end, rerun the data gathering queries I've outlined and compare the data. Saving the example queries as well as any new or modified queries and processes you have come up with makes this process more repeatable and is highly recommended. The more straightforward and repeatable the processes, the more likely the analysis will prove useful.
Comparison of anomalous data against baselines generally involves comparing the associated numbers against each other for the given queries and then digging in from there. For instance, the TCPFlag distribution query output example showed a 4 percent SYN distribution for normal traffic periods. Doing this same data collection during an event might show something like you see in Figure 10. Given the wide shift of SYN and SYN-ACK packets, we would begin defending based on this being a SYN flood. But what if the data isn’t so clear?
Digging in Deeper
Based on the data from a given query, you may want to use more specific queries to find more information about the client or clients making requests at anomalous rates. For instance, you have gone through the work of automating alerting on SYNs/SEC/SrcIP and the system has detected a large spike in the number of SYNs for several hosts. The output of the SYNsPerSecPerSrcIP query might look something like Figure 11.
Since the number of SYNs per client IP is not that high, we might be looking at a full HTTP GET attack or a distributed SYN flood where each IP does not send a significant number of SYNs. Since we have discussed the anomalous TCPFlags distribution scenario, we will assume the distribution in this new example is actually within the norms. Based on that, a review of the actual requests reaching IIS, HTTP.SYS, and URLScan logs would be in order.
By using the CIP-URI.sql query to look at the IIS logs, with a modification to grab data specific to the particular IP address that we suspect is causing issues, we can quickly determine what that client is doing that might be anomalous.
The SQL file should now contain an update with a WHERE clause for the specific IP address or addresses in question, like so:This will show data about the actual requests these two clients are making as well as the number of requests. From there you can make decisions about what actions to take to mitigate the issue.
SELECT c-ip AS ClientIP, cs-uri-stem AS URIStem, sc-status AS Status, cs-host AS HostName, cs(User-Agent) AS UserAgent, count (*) as Requests INTO DATAGRID FROM %logfile% WHERE c-ip IN (‘10.1.1.1’;’10.2.2.2’) GROUP BY c-ip, cs-uri-stem, cs-host, cs(User-Agent), sc-status ORDER BY Requests DESC
Keep in mind that while you may notice a SYN flood or full HTTP GET request flood, there may be another attack happening behind the scenes. This query might show this if the requests are making it up to IIS; however, at this point it is useful to look at the traffic that the given clients are sending that is not making it to IIS.
For this we would update the URLScan and HTTP.SYS queries to zero in on the specific IPs as well as running the general comparison. As an example of one of the more specific queries, we might modify the HTTPSYSInvalidRequestsPerClientIP.sql query to include the error reason as follows:Further digging will be required to determine all of the requests a particular client is making against your systems and applications, but these examples give you an idea of the approach you can take to get there.
SELECT c-ip AS ClientIP, s-reason AS ErrorReason, Count(*) as Errors INTO DATAGRID FROM %logfile% WHERE c-ip = ‘10.1.1.1’ GROUP BY c-ip, s-reason ORDER BY Errors DESC
Clearly, gathering the baseline data just once will not provide significant benefit and manually gathering the data multiple times can be prohibitive given the many other tasks required to keep a site up and running. The optimal solution is to automate data gathering, storage, and reporting. Depending on the resources available, the level of automation will vary greatly.
However, even with minimal resources, it is possible to automate the collection of a great deal of this baseline and event data. For instance, using the INTO clause from Log Parser to inject the data into a SQL database and running the Log Parser commands from scripts in scheduled tasks provide the rudimentary data collection automation necessary to make this task reasonable from a time investment and the return on investment is huge if there is an anomalous event you must investigate.
Keep in mind that you will also need to investigate other information sources to understand the full picture. You also must collect the right server performance data, such as CPU and memory utilization, to fully understand the baseline performance characteristics of a site. Each new application deployment will bring with it the need to baseline the new characteristics so tracking over time is essential to ensuring the correct and current outlook during an attack or event.
Wrapping It Up
I’ve covered some of the basic uses of Log Parser and Netmon for baselining normal traffic and for analyzing anomalous events. While this is by no means an exhaustive review, this information will help you expand your understanding of the traffic patterns and client behaviors on your site and help you prepare for anomalous events. Look for more information on this subject as well as other topics relating to our Microsoft.com operations team at blogs.technet.com/mscom.
The following members of the Microsoft.com team have contributed to this column: Paul Wright, Technology Architect Manager; Ivan Ivanov, Technology Architect; Guy Gordon, Program Manager; and Jeffrey A. Johnson, Senior Systems Engineer.© 2008 Microsoft Corporation and CMP Media, LLC. All rights reserved; reproduction in part or in whole without permission is prohibited.