[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

8.18.1 The problem of spam

First, some background on spam.

If you have access to e-mail, you are familiar with spam (technically termed UCE, Unsolicited Commercial E-mail). Simply put, it exists because e-mail delivery is very cheap compared to paper mail, so only a very small percentage of people need to respond to an UCE to make it worthwhile to the advertiser. Ironically, one of the most common spams is the one offering a database of e-mail addresses for further spamming. Senders of spam are usually called spammers, but terms like vermin, scum, and morons are in common use as well.

Spam comes from a wide variety of sources. It is simply impossible to dispose of all spam without discarding useful messages. A good example is the TMDA system, which requires senders unknown to you to confirm themselves as legitimate senders before their e-mail can reach you. Without getting into the technical side of TMDA, a downside is clearly that e-mail from legitimate sources may be discarded if those sources can't or won't confirm themselves through the TMDA system. Another problem with TMDA is that it requires its users to have a basic understanding of e-mail delivery and processing.

The simplest approach to filtering spam is filtering. If you get 200 spam messages per day from `random-address@vmadmin.com', you block `vmadmin.com'. If you get 200 messages about `VIAGRA', you discard all messages with `VIAGRA' in the message. This, unfortunately, is a great way to discard legitimate e-mail. For instance, the very informative and useful RISKS digest has been blocked by overzealous mail filters because it contained words that were common in spam messages. Nevertheless, in isolated cases, with great care, direct filtering of mail can be useful.

Another approach to filtering e-mail is the distributed spam processing, for instance DCC implements such a system. In essence, N systems around the world agree that a machine X in China, Ghana, or California is sending out spam e-mail, and these N systems enter X or the spam e-mail from X into a database. The criteria for spam detection vary--it may be the number of messages sent, the content of the messages, and so on. When a user of the distributed processing system wants to find out if a message is spam, he consults one of those N systems.

Distributed spam processing works very well against spammers that send a large number of messages at once, but it requires the user to set up fairly complicated checks. There are commercial and free distributed spam processing systems. Distributed spam processing has its risks as well. For instance legitimate e-mail senders have been accused of sending spam, and their web sites have been shut down for some time because of the incident.

The statistical approach to spam filtering is also popular. It is based on a statistical analysis of previous spam messages. Usually the analysis is a simple word frequency count, with perhaps pairs of words or 3-word combinations thrown into the mix. Statistical analysis of spam works very well in most of the cases, but it can classify legitimate e-mail as spam in some cases. It takes time to run the analysis, the full message must be analyzed, and the user has to store the database of spam analyses.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

This document was generated on October, 20 2003 using texi2html