Data Validation – Step One in Improving the Security of Your Web Applications

See other Security MVP columns

By Rudolph Araujo, Microsoft MVP – Developer Security


The lack of data validation in Web applications has gone beyond just being a problem with a single application because it now has an impact on entire organizations and the larger Internet community. For example, automated SQL injection attacks are used not only to steal data such as credit card numbers, but also, increasingly, to help spread malware to visitors of websites. A survey of common vulnerabilities affecting the security of Web applications shows that most of them result from the lack of data validation. However, security in general and data validation in particular often receive little to no attention during the software development cycle, which means vulnerabilities aren’t discovered until “penetration testing” or, worse still, when an attacker actually compromises the application in question.


Before we get too much further it would help to define what we mean by the term “data validation.” Data seems easy enough to understand, but perhaps the term is so familiar that we ignore what it encompasses. By data we mean any bits that interact with the application: inputs from the user or the system such as configuration or data files, outputs such as log files, or markup to HTML, SQL, LDAP, or a variety of other protocols. Data is essentially any or all information processed, transmitted, or stored by the application.

As to the validation part of data validation—think of it as the verification of four fundamental properties of the data: length, range, format, and type checks.

Length refers to the size of the data—the number of characters in a string or the number of bytes about to be placed into a character buffer. It is easy to get these two confused and a lot of buffer overflows result precisely from this confusion.

Range refers to deciding whether a value is valid in a given context. For example, if the price of an item is set to a negative value, it should indicate a problem more often than not.

Format refers to verifying form or what the data should look like. Like range, this is usually a function of context and the real-world object the data represents. For instance, a sequence of nine digits in one context could be a telephone number, a zip code, or perhaps a social security number. Of course it could also be just a nine-digit integer or a primary key in the database. Format checks verify that the data is in the expected format and special characters—such as the hyphens in a social security number or letters in a British postal code—are in place.

Type is the raw data type associated with the underlying item. This is most often an issue with weak type such as the scripting languages, when type mismatches can result in unpredictable results from an application. For instance, consider what happens when instead of passing a numeric field such as your age, you pass a string containing SQL fragments?

One effective strategy for successful data validation is to define the rules for the four properties above as part of the design phase (perhaps within a data-flow diagram, since that in itself identifies all the different data elements within the application). Considering data validation during design also ensures that it is not left to developers to implement in their own way or not at all. In fact, considering validation during design allows for deploying a centralized data validation strategy. A common approach is to implement a data validation funnel, as shown in the figure below.


As the figure illustrates, a validation funnel siphons all inputs through a single validation module; it handles outputs similarly. It is important to note that the funnels need not be separate physical components, but they can be shared classes that, for instance, filter and sanitize all inputs and outputs based on a set of rules. Ideally this set of rules should be configurable without having to rebuild the module.

Once the application architecture has a centralized data validation chokepoint, the next consideration is what data needs to be validated. We do want to balance both performance and security; hence, validating data at every level and within every component and function is neither desired nor required. This is where the concept of trust boundary can be useful. A trust boundary is a logical edge at which one side does not trust the other. For instance, a trust boundary usually should exist between the client and the server, or across a remoting API that is shared by both internal applications and partner applications. Most often, the trust boundary can be defined at the location where the policies associated with a system change. A good way to identify these locations is to look for network devices such as firewalls, VLANs, or authentication mechanisms.

Once your trust boundaries have been defined and the data to be validated has been identified, all that is left to do is to actually validate the data. This is where we go back to the four key properties we discussed above.

Most developers using C# or VB.NET do not typically have to worry about length and buffer checks because of automated memory management. However, even with these languages, validating for length can help prevent unnecessary reallocations and memory copies. Further, especially when dealing with data that may be sensitive, developers would like to avoid multiple copies being left behind in memory with only the garbage collector controlling when they would be cleared.

When dealing with an acceptable range of values, ensure that the data to be validated is within the expected or acceptable range. The canonical example of this is the prices of goods on e-commerce websites. A number of websites and online shopping-cart services continue to be vulnerable to “negative price / shipping cost / tax” attacks wherein the attacker can influence the price she pays. Similarly, when dealing with numbers, it is important to understand the range of the base numeric type being used to store the number and the difference between signed and unsigned numbers. For instance, what happens when you increment a number beyond its maximum value or decrement it below its minimum value? How does that impact application logic and security? Similarly, when dealing with values returned from drop-down or list boxes, it is best to implement a data indirection pattern wherein only the option index is obtained from the client, and to ensure that if the index does not fall within an acceptable range an error is returned. In general, two approaches are common with range-based validation: blacklist and whitelist data validation. As the name would suggest, blacklist data validation involves creating a list of “bad” data items that are then blocked. Whitelisting, on the other hand, involves creating a list of items that are accepted based on business rules while dropping everything else. As one would expect, it is much easier to build an all-encompassing whitelist than it is to build a blacklist that is effective in blocking all current and potential future attacks. Therein is the major problem: your blacklist is only as effective as your current knowledge of attack patterns.

This is perhaps the aspect that is most ingrained in the business logic of an application. In most cases, format checks entail checking whether the programmatic representation of an entity is consistent with its real-world counterpart. There are a number of effective mechanisms for performing such format validations, but perhaps the most efficient and elegant approach is to take advantage of regular expressions. In the Microsoft .NET Framework this is done using the asp:RegularExpressionValidator object[i]. When dealing with XML data representations, this can be taken even further by the use of an XSD schema to perform granular validation against the data elements contained within the XML document. For instance, a number of attacks these days attempt to compromise not the application but the XML parser running within the application, e.g. through an XML Denial of Service (XDOS).

Format validation however does have another important dimension that is often forgotten and can be the source of numerous and repeated problems. The source of these problems primarily lies with the fact that the exact same data can be represented in multiple different formats. For instance, consider the less than symbol, ‘<’, which can be represented as ‘&lt;’ when HTML encoded or “&#x3c;” or even “&#60;”. Other common encoding formats on the Web include URL encoding and hexadecimal encoding. Given the multitude of encoding formats, canonicalization becomes critical, and all validation must be performed after data has been decoded into its most basic form.

A special case of format checking occurs when dealing with file uploads or downloads. In our experience, we have only very rarely found examples of applications that have actually implemented both in a secure manner. For instance, with file uploads, checking MIME types and performing selective virus scanning is regarded as a good practice. Similarly, file uploads should be throttled to avoid disk space exhaustion attacks. In the case of downloads on the other hand, developers must be concerned that arbitrary files cannot be downloaded from outside the Web root in a Web application. Such issues can be handled by using functions such as Server.MapPath[ii]. Developers must decide, for instance, whether path components and relative paths would be allowed at all. Similarly, it is important that all access control is performed on the basis of file system based access control lists rather than simply on the name. This is especially significant when 8.3 names are enabled on the system. For instance, ThisIs~1.doc is identical to the document ThisIsASecretDocument.doc when they are in the same folder. In this case, if your access control were based on the whole name matching, an attacker could trivially subvert your access control mechanism by using the 8.3 file name. Hence, all access control must be based on handles rather than names.

This is perhaps the most often ignored and rarely used attribute of the data, especially when dealing with strongly typed languages such as C, C++, or C#. However, in weakly typed scripting languages such as Perl or JavaScript, it is important to ensure that if the application is expecting a string, then that is what is presented to the application (rather than a numeric type, for example). In the scripting languages, this is best done by requiring, as a matter of coding standards, that all variables be declared with a type before they are ever used. In .NET, the reflection mechanism provides an effective and efficient way of querying object metadata and identifying data types. This is especially true when dealing with dynamic code or object serialization attacks. In VB.NET this can be enforced through the use of settings such as Option Explicit[iii] and Option Strict[iv].


Data validation, or the lack thereof, is perhaps the biggest cause of security vulnerabilities in applications. Ranging from buffer overflows to LDAP injection and cross-site scripting to SQL injection - all of these vulnerabilities can be prevented or mitigated through the use of effective data validation. As with other methods of building security into applications, and with other categories of security vulnerabilities, data validation is hard if not impossible to retrofit into an application. Hence, the only effective and efficient practice is to consider data-validation strategies as an integral and central part of the application architecture. Doing so also ensures that data validation is consistently implemented and that it covers all trust boundaries and all data being processed, transmitted, or stored in the system. Once architected properly, data validation then simply comes down to basic length, range, format, and type checks.


Rudolph Araujo, a Technical Director at Foundstone ( ), is responsible for leading the software and application security service lines. He also leads the content creation and training delivery for Foundstone’s software security classes. Rudolph’s varied experience at Foundstone includes helping secure custom operating system kernels, hardware virtualization layers, device drivers as well as user-mode standalone, client / server, and Web applications. Rudolph is an experienced C / C++ and C#/.NET developer and the author of Foundstone’s .NET Security Toolkit, SSLDigger, and Hacme Bank tools. He is also a regular contributor to MSDN’s webcast series and to multiple online and print journals such as MSDN and Software Magazine. Rudolph has been honored for the last three years in a row with the Microsoft Visual Developer – Security MVP Award in recognition of his thought leadership and contributions to the security and developer communities.  He has also written the foreword for the Microsoft Patterns and Practices Group’s Web Services Security Guide and is a contributing author to the book Developing More-Secure Microsoft ASP.NET 2.0 Applications.