Communication & Collaboration
Fight Spam on Your Terms with Custom Weight Lists
Cam Frenette and Alexander Nikolayev
At a Glance:
- Custom Weight Lists
- The Intelligent Message Filter
- How to filter messages to stop spam
- How to search text appropriately
Spam, as we all know, is a huge problem. It clogs up your servers, aggravates your users, sucks up your bandwidth, and communicates unwanted and often inappropriate messages.
Can anything be done to stop it?
Well, if you're running Microsoft® Exchange Server 2003 you may have noticed that new features and functionality released last October in Service Pack 2 (SP2) significantly improved its ability to withstand different vectors of spam attacks. With multiple layers of anti-spam defense, Exchange Server 2003 can provide strong protection against unwanted messages. One of the most important elements in the Exchange anti-spam framework is the Intelligent Message Filter (IMF), which enables content filtering during the last stage of server anti-spam processing. Inside IMF, a little-known module called the Custom Weight List (CWL) turbocharges the anti-spam efforts of SP2. Let's take a look at what it does and how to use it.
What is a Custom Weight List?
The CWL is an XML file that contains a number of words or phrases that, if found in the body or subject of the message, can trigger modification of the final spam confidence level (SCL) assigned by the IMF. They are called Custom Weight entries and they look like this:
<?xml version="1.0" encoding="utf-8"?> <CustomWeightEntries xmlns="http://schemas.microsoft.com/2005/CustomWeight"> <CustomWeightEntry Type=" " Change=" " Text=" " /> ... </CustomWeightEntries>
As you can see, there are three attributes related to examining an e-mail message for a word or phrase match: Type attributes, Change attributes, and Text attributes.
The Text attribute comes last for each entry, but it is the content the IMF will be looking for in the e-mail. This attribute can be a single word or a phrase, and it can contain escaped characters, upper ASCII characters, or double-byte characters, and any Unicode word or phrase up to 1000 characters long. There can be multiple Text attributes in a CWL file. One of the most common uses of Text attributes is to block adult-oriented advertising based on keywords in the message subject and body.
Where the IMF will look is determined by the Type attribute, which can be SUBJECT, BODY, or BOTH (indicating the part of the e-mail your Text should be matched against). Use one of the following Type attributes to specify that a search for a match will be performed on the subject of a message, the body, or both:
Type = SUBJECT Type = BODY Type = BOTH
The Change attribute can be any integer value (positive or negative). This attribute defines how the SCL value of the message will be changed if the Text attribute is matched. If a match is found, the value of the Change attribute will be added to the original SCL value (or essentially subtracted, if the Change attribute value is negative). If the combined (original and CWL) values exceed the SCL range, the final SCL assignment will be normalized between 0–9. Change can also use the MIN and MAX keywords. Any time a phrase with the MIN keyword is matched, the message is given an SCL of 0 regardless of any other weights. Any time a phrase with the MAX keyword is matched, the message is given an SCL of 9 regardless of any other weights. If there's a match for both MIN and MAX, the message is given an SCL of 0. To ensure that a mail item with matching characteristics will be blocked, use the MAX keyword.
SCL values for mail items fall between 0 and 9, with 0 indicating the least likelihood that the message is spam and 9 indicating the most. There is an additional value of "-1" (reserved by Exchange Server), which implies that the mail came from a trusted, authenticated source (another authenticated Exchange Server) and as such it is exempt from the anti-spam processing. Use of this value is enabled by default; to disable it, please read the Intelligent Message Filter Operations Guide at go.microsoft.com/fwlink/?LinkId=71143. Figure 1 shows how to interpret the SCL values.
|-1||Messages coming from a trusted source (authenticated Exchange Server)|
|0||Messages categorized as not spam|
|1–5||The likelihood of messages being spam is extremely low to low|
|6–9||The likelihood of messages being spam is high to extremely high|
As noted earlier, the IMF comes into play during the last stage of anti-spam processing. After a mail item passes through the layers of the Exchange Server 2003 anti-spam framework (such as the Connection and Protocol Filtering layers), it becomes a subject for IMF verification. IMF processes the mail item and assigns the appropriate SCL. The mail is then examined by the CWL to see if there is a match with any Custom Weight entries. If a match is found, the Change attribute is applied to the SCL value of the message. Hence, the Final SCL assignment looks like this:
Final SCL == original IMF SCL + CWL Change attrib.
The Custom Weight entries are loaded into memory when the IMF is starting up; any changes to the CWL file require a restart of corresponding Exchange and SMTP services to get the new entries loaded into memory. The entries are stored as hashtables to allow fast look-up for both single word and phrase matches while only parsing messages a single time.
Using the Custom Weight List
If you're under a spam attack and the offending mail items have a consistent phrase in their subject line, you can add that part of the line to your CWL entries, as shown here:This would ensure that any time a message with the subject line "Bad Subject" is encountered, the message's SCL would be forced to 9 and the message would be blocked.
<CustomWeightEntry Type="SUBJECT" Change="MAX" Text="Bad Subject" />
On the other hand, if IMF is incorrectly classifying a message as spam, you can rescue that message. Suppose, for example, an auto-generated build status mail with the subject line "Build System Daily Mail" is getting blocked. You can add the following CWL entry:This would ensure that such mail would get an SCL of 0 and would not be blocked.
<CustomWeightEntry Type="SUBJECT" Change="MIN" Text="Build System Daily Mail" />
If you have a list of key words or phrases that you want to use as modifiers to increase or decrease the SCL of a message, you can simply add them to the CWL entries. The following example illustrates that the word pear should be seen as a spam indicator in the subject; orange should be a spam indicator in the subject or body; banana should indicate good mail when in the body; and strawberry should indicate good mail when in either subject or body:After encountering the appropriate match, the IMF will adjust the final SCL value set on the message.
<CustomWeightEntry Type="SUBJECT" Change="3" Text="Pear" /> <CustomWeightEntry Type="BOTH" Change="5" Text="Orange" /> <CustomWeightEntry Type="BODY" Change="-2" Text="Banana" /> <CustomWeightEntry Type="BOTH" Change="-4" Text="Strawberry" />
Order of Precedence
What happens when you have a mix of positive and negative weights, along with some MIN and MAX weights in your Custom Weight entries? If, by mistake, you assign the MIN and MAX keywords to the same Custom Weight entry, MIN takes precedence and the final SCL assignment is 0. This is done to prevent accidental message blocking due to CWL file misconfiguration. Precedence works as follows:
- If an entry with MIN change is matched, the SCL of the message will be 0.
- If an entry with MAX change is matched and there is no MIN token, the SCL of the message will be 9.
- If no MIN or MAX entries are matched, then the SCL of the message will be the SCL determined by the IMF plus the change of any matched Custom Weight entries.
For example, suppose you have the following Custom Weight entries:Let's see what happens if the incoming mail contains the following string in the message body: "Hello world and welcome to the Internet place". The IMF returns an SCL of 4 but the final SCL assignment would be 0. This is because the word "hello" has been matched and that entry has a MIN change keyword. None of the other entries are taken into consideration at this point.
<CustomWeightEntry Type="BODY" Change="MIN" Text="hello" /> <CustomWeightEntry Type="BODY" Change="MAX" Text="world" /> <CustomWeightEntry Type="BODY" Change="1" Text="Internet" /> <CustomWeightEntry Type="BODY" Change="-3" Text="place" />
Now let's see what happens if the message body contains the string "Welcome to the world and the Internet place." Regardless of the value originally produced by the IMF, the final SCL value would be set to 9 because the Text attribute value "world" was matched and the entry includes the MAX keyword. As there is no MIN entry to override, the message will be given an SCL 9 and blocked.
Here's another interesting example. Let's say the message body contains the "world Internet place" string and IMF returned an SCL of 6. What would be the final SCL assignment? It would be 4: the original SCL 6 from IMF plus 1 because of the "Internet" entry match and -3 because of the "place" entry match.
Substrings, Phrases, and URLs
CWLs will not match an entry to a substring of a word, but will match a shorter phrase to part of a longer phrase. Say you want to block mail with "Free Watches" in the subject line, so you put the following Custom Weight entry in your CWL file:This will not work as intended. You won't hit the CWL since "watch" is only a substring of "watches". You would need to explicitly include the word "watches" or the whole phrase "Free Watches".
<CustomWeightEntry Type="SUBJECT" Change="MAX" Text="watch" />
Now suppose, based on this example, you've changed the Custom Weight entry from "watch" to "Free Watches" and your string looks like this:If the subject of the e-mail is "Get your Free Watches Now!!!", you'd have a match. This is because the phrase in the CWL file will be matched against the part of the phrase found in the subject.
<CustomWeightEntry Type="SUBJECT" Change="MAX" Text="Free Watches" />
Now let's say you have the following URL in your custom weight entries:This token would match against the following subjects:
<CustomWeightEntry Type="SUBJECT" Change="MIN" Text="microsoft.com" />
This is due to the way custom weighting tokenizes the phrases it sees. In essence, the phrases are split into words any time there's a break in the character type from alphanumeric to anything else: spaces, tabs, commas, periods, slashes, and so on. (Whitespace, though, is different. It is thrown away, while any of the punctuation characters become a word in the phrase.)
In this example, www.microsoft.com gets translated into a five-token phrase—www, period, Microsoft, period, com. The Custom Weight entry gets translated into a three-token phrase—Microsoft, period, com. The Custom Weight entry is now a match to part of the phrase of the subject.
Since the Custom Weight entries are stored in an XML file, there are certain characters that can't be used in a search entry without encoding. For example, if you wanted to block e-mails that contained the phrase "<Hello>", then your Custom Weight XML would contain the following entry:
<CustomWeightEntry Type="SUBJECT" Change="MIN" Text="<Hello>" />
CWL functionality supports Unicode characters in the CWL file. The actual custom weighting process is language agnostic—as long as the word or phrase in the entries matches a word or phrase in the e-mail being processed, then the Custom Weight will be used. So the Custom Weight entries in Figure 2 are perfectly legit.
<CustomWeightEntry Type="SUBJECT" Change="MIN" Text="こんにちは " /> <CustomWeightEntry Type="SUBJECT" Change="-7" Text= "Bienvenue à Redmond" /> <CustomWeightEntry Type="SUBJECT" Change="-4" Text="Первый" /> <CustomWeightEntry Type="BODY" Change="9" Text="Verlängertes Angebot" /> <CustomWeightEntry Type="SUBJECT" Change="MIN" Text="特別提供" /> <CustomWeightEntry Type="BOTH" Change="MAX" Text="Offre spéciale" />
There are, however, a few caveats to using high-ASCII and double-byte characters. First, you need to make sure you have the correct encoding in your Custom Weight XML header and that the file is saved as the correct encoding.
Second, the e-mail you are checking needs to be properly encoded. If the MIME type of the e-mail is incorrect or the wrong character-encoding is specified, the IMF might not be able to process the Custom Weight entries correctly.
Third, due to the way strings are tokenized in custom weighting, phrases with double-byte characters, especially East Asian languages, are tough to create for use with the CWL file. This is because the punctuation and whitespace characters used to tokenize many European languages don't match up well. If you are setting up your CWL to be used with double-byte characters, make sure to give it a round of testing to find a correctly matching Custom Weight entry.
When you're ready to use the CWL file, remember that, by default, the Exchange Server 2003 SP2 installer does not create it for you. You'll need to do this manually (see the Intelligent Message Filter Operations Guide mentioned earlier for details). And don't forget that the CWL file, MSExchange.UceContentFilter.xml, should always be in the working path of IMF, as shown in Figure 3.
Figure 3 The Working Path of IMF (Click the image for a larger view)
If the CWL file is not located in the IMF's working directory, it will not work. If you have custom weighting deployed and are using IMF Updates, make sure that every time new spam definitions are installed, the CWL file is moved to the new IMF location. You'll have to do this manually. The Intelligent Message Filter Operations Guide provides detailed information how to enable IMF Updates and what you need to do to ensure that the Custom Weight List continues to operate successfully.
Cam Frenette is a Software Design Engineer in Test at Microsoft who focuses mainly on anti-spam technologies for the Technology Care & Safety team. He holds a B.Math from the University of Waterloo.
Alexander Nikolayev is the Program Manager at Microsoft in charge of server-side protocols, Transport Core, and anti-spam components for Exchange and Windows Servers. Alexander holds an MBA degree from the University of Mary. Read his posts on the Exchange team blog blog.
© 2008 Microsoft Corporation and CMP Media, LLC. All rights reserved; reproduction in part or in whole without permission is prohibited.