Effectively fighting spam with DSPAM

Have you ever tasted spam? I mean the real spam. The kind served in vaguely rectangular boxes and is the color of ham. It was served to me once on Christmas Eve, and I can assure you that it's really not good! But the spam we'll talk about here is different, more electronic, but just as disgusting.

Fighting spam is probably the most complex task of a postmaster. The techniques are numerous, and it is essential to multiply them to get a relevant result. DSPAM is a statistical engine that analyzes a text and produces a probability based on the content. The internal mechanisms of DSPAM are a bit tricky to understand, that is why we recommend you spend some time reading this documentation (and especially chapter 3: configuration) while preparing your setup.

We will discuss the implementation of DSPAM with a Postfix SMTP server. We will rely of the source code for the installation, but you should probably check your distribution's repository for existing packages.

- Overview

DSPAM was originally written by Jonathan Zdziarski, an American developer, in 2003, following his research on the classification of spam. DSPAM was subsequently sold in 2006 to Sensory Networks. Since the beginning DSPAM was released under GPLv2 or later and got changed to GPLv2 in 2009. Since 2009, DSPAM is maintained by a small group of developers.

DSPAM is mainly written in C and requires a backend to store the data. Drivers for MySQL, PostgreSQL and SQLite are available; it is also possible to rely on a Hash Driver that create data files on disk, for use without a relational database. This “Hash Driver” is the default option and used to be the fastest, but nowadays a PostgreSQL backend is preferred.

DSPAM produces statistical data for each user because this approach is proved to be more efficient than having a global ruleset for all users (see Technology below). Thus, each address of the domain (as <user> @ <domain> or normal flat names as <user>) has tokenized data in one of the supported storage engines (MySQL, PostgreSQL, SQLite or Hash Driver) and some additional data like logs, statistics for the Web-UI, quarantine, corpi, preferences, etc in DSPAM's data directory. It is possible to share information between multiple users in the form of groups. There are several types of groups, which we will detail later.

- Technology

Back in 2002, Paul Graham, another U.S. developer, published “A Plan for Spam”, an article that changed the way people analyze spam. In the early days antispam rules were largely based on criteria specific to spam such as “under capitalized” or “contains 48 exclamation points”. Paul Graham has worked on such rules, which worked pretty well, but the problem was that a low percentage of false positives result, on the order of 1 to 2%, was extremely difficult to filter.

His idea, which others had before him without the same influence, is to divide the content of an e-mail into tokens; a token being typically a word, a component of headers, an html tag, etc. … and calculate statistics on them using the Bayes algorithm. The results being very satisfactory, Graham's technique has become the norm and is at the heart of DSPAM.

Graham has also shown that, to be truly effective, statistics should be produced for each user individually. Using a global basis for all users is less effective, since some words commonly used by a user may be considered spam to another (those of you who work for pharmaceutical companies certainly understand the principle).

- Installation

DSPAM source code is available on http://dspam.sourceforge.net/, and the latest version available at this writing: dspam-3.9.1-RC1.

The archive contains the documentation, which is sorely lacking on the wiki. In fact, the quite detailed README files (and this document) forms the core of what one has to know about DSPAM.

Before starting the compilation, let us define what we want to do:

DSPAM must interface with Postfix (as content-filter), and will therefore receive and reinsert the email via TCP sockets on localhost. DSPAM should run in daemon mode.
Each user is identified via their email address in full
Each user will have their own dictionary of tokens and associated statistics.

DSPAM does not really have external dependencies. A fresh install of Linux with a few tools installed (gcc, make, the backend libraries, …) is enough to build it. It is also important to run the daemon with limited privileges (eg. as user 'dspam'). <note> The following configuration options fit well the folder organization of a Debian system. You will have to adapt them to your own setup.</note>

$ su
# useradd -r -s /bin/false -U -d /var/spool/dspam dspam
# exit
$ ./configure --enable-daemon --enable-split-configuration --enable-syslog --enable-clamav --enable-preferences-extension --enable-domain-scale --with-dspam-home=/var/spool/dspam --with-dspam-home-owner=dspam --with-dspam-home-group=dspam --with-dspam-owner=dspam --with-dspam-group=dspam --with-storage-driver=hash_drv --prefix=/usr/local/dspam --sysconfdir=/etc/dspam --mandir=/usr/share/man --bindir=/usr/bin --sbindir=/usr/sbin --libdir=/usr/lib --includedir=/usr/include
$ make
$ su
# make install

The compilation options are detailed in './configure –help'. You can also enable debugging with the following options '–enable-debug –enable-bnr-debug –enable-verbose-debug', but beware of the amount of logs produced (in /var/spool/dspam/log). If you want to compile DSPAM with support for PostgreSQL as a storage backend (instead of the hash driver), you can use the following configuration parameters:

$ ./configure --enable-daemon --enable-split-configuration --enable-syslog --enable-clamav --enable-preferences-extension --enable-domain-scale --with-dspam-home=/var/spool/dspam --with-dspam-home-owner=dspam --with-dspam-home-group=dspam --with-dspam-owner=dspam --with-dspam-group=dspam --with-storage-driver=pgsql_drv --with-pgsql-includes=/usr/include/postgresql/ --with-pgsql-libraries=/usr/lib/ --enable-virtual-users --prefix=/usr/local/dspam --sysconfdir=/etc/dspam --mandir=/usr/share/man --bindir=/usr/bin --sbindir=/usr/sbin --libdir=/usr/lib --includedir=/usr/include --enable-debug --enable-bnr-debug --enable-verbose-debug

<note>To build DSPAM with the Postgresql backend, you need the psql libraries (packages libpq5 and libpq-dev in Debian Squeeze).</note> Off course it is possible to compile DSPAM with support for more than one storage backend. To do so you can use the following configuration parameters:

$ ./configure --enable-daemon --enable-split-configuration --enable-syslog --enable-clamav --enable-preferences-extension --enable-domain-scale --with-dspam-home=/var/spool/dspam --with-dspam-home-owner=dspam --with-dspam-home-group=dspam --with-dspam-owner=dspam --with-dspam-group=dspam --with-storage-driver=hash_drv,pgsql_drv --with-pgsql-includes=/usr/include/postgresql/ --with-pgsql-libraries=/usr/lib/ --enable-virtual-users --prefix=/usr/local/dspam --sysconfdir=/etc/dspam --mandir=/usr/share/man --bindir=/usr/bin --sbindir=/usr/sbin --libdir=/usr/lib --includedir=/usr/include --enable-debug --enable-bnr-debug --enable-verbose-debug

As you see, we have enabled the Hash driver and support for PostgreSQL. As soon as you use more than one storage backend, DSPAM will compile them in separate shared library files (libpgsql_drv.so for PostgreSQL and libhash_drv.so for the Hash driver) and allow you to choose inside dspam.conf which storage engine you would like to use.

- Configuring DSPAM

Before feeding DSPAM with the flow of emails from Postfix, we will configure and test it. The default dspam.conf configuration file comes with a large number of comments, but is not all that easy to interpret without a careful reading of the README and this documentation. dspam.conf is pre-filled with the parameters from the ./configure command. It contains the configuration options related to the chosen backend database, the home folder, and so on.

[…]
# 
# DSPAM Home: Specifies the base directory to be used for DSPAM storage 
# 
Home /var/spool/dspam 
[…]
# 
#StorageDriver /usr/lib/dspam/libhash_drv.so 
StorageDriver /usr/lib/dspam/libpgsql_drv.so 
[…]

If you have selected during configure just one storage driver then you don't need to specify in dspam.conf which one. DSPAM will automatically know what storage driver you configured and will use it.

More on the Storage backends in the storage_driver%C2%A0 section.

- Communication with the SMTP server

As said earlier, we want Postfix to communicate with DSPAM using TCP sockets. This setup requires two separates communications:

submit to dspam [SMTP server to DSPAM]:
- DSPAM will listen on the chosen TCP port and wait for connections coming from the SMTP server
response from DSPAM [DSPAM to SMTP server]:
- After analyzing the message, DSPAM sends it back to the SMTP server.

The submission socket will receive messages from Postfix. It listens on port TCP/10033 (arbitrary choice) and will speak LMTP (LMTP is a lightweight version of SMTP for intra-infrastructure mail transport).

ServerPort 10033
ServerQueueSize 32
ServerPID /var/run/dspam/dspam.pid
ServerMode auto
ServerParameters "--deliver=innocent, spam –d %u"
ServerIdent "localhost.localdomain"

The directive ServerParameters tells DSPAM reinject innocent emails and spam, as opposed to keeping spam in quarantine. While testing your setup, it is better to forward suspected spam to the user's mailbox, and filter them using a mark on the Subject and/or on the headers, rather than quarantining them directly (note that it is possible to send a list of quarantined messages to your users daily).

DSPAM will then connect to Postfix and reinject the email after analysis. The following parameters connect back to Postfix on port TCP/10034 (postfix needs to be configured as well, we'll discuss that later).

DeliveryHost 127.0.0.1
DeliveryPort 10034
DeliveryIdent localhost
DeliveryProto SMTP

Note that we speak SMTP here and not LMTP anymore.

- Mode of learning

DSPAM starts its operations with empty dictionaries. This means that during the first weeks, DSPAM will learn a lot and filter little (and progressively inverse that).

It also means that it is the responsibility of the users to mark emails as spam (or ham if DSPAM mistakenly marks a message as spam). It is left to the postmaster to provide to its users a simple way to mark emails.

Several learning methods exist and are described in the man page of DSPAM. The one that interests us here is called 'teft' and forces DSPAM to learn about each email it processes, innocent and spam.

This mode is particularly intensive because it goes through every email and creates or updates all the tokens created from the message in the user dictionary. It's perfect for a new user who needs to quickly build up a dictionary, but may consume too much CPU in a busy environment. To use teft mode, set the following directive in dspam.conf:

TrainingMode teft

To overcome the problem of performance, other modes of learning exist. Mode 'tum', for example, learns on all message as well, but only for limited period of time (called training) and will only update the dictionary upon user interaction afterward.

This parameter can be set for each user separately, as we will see in the preferences. The default mode is the one set in dspam.conf.

- Method of Detection

We are now at the core of DSPAM: the mode of detection. DSPAM is essentially a statistical analysis composed of 3 sub-parts:

Content tokenizing
Statistical algorithm
Calculation of probability

- Content Tokenizing

This is the module that will break up content, making each piece into a token, and store the token's unique hash in the user dictionary. These tokens can be of several forms depending on the mode chosen, the most basic being to take the words one by one, every word is a new token.

But there are also more advanced modules, capable of taking into account different parts of each sentence. For those who like the Germanic prose, here's how a sentence will be cut by the different modules:

“Heute Abend war ich mit meiner Freundin im Kino und habe viel gelacht”

The character '+' means a combination of words, the character '#' denotes a word not taken into account.

WORD module: each word represents a token, it has 13 tokens.

TOKEN: ‘Heute’ CRC: 6716984897371635712
TOKEN: ‘Abend’ CRC: 6670531613365895168
TOKEN: ‘war’ CRC: 4772677679197454336
TOKEN: ‘ich’ CRC: 6329956816985784320
[...]

CHAIN module: the word is related to the word that follows, we therefore have one token less, or 12 tokens.

TOKEN: ‘Heute+Abend’ CRC: 9299536586222406967
TOKEN: ‘Abend+war’ CRC: 5205867775940263209
TOKEN: ‘war+ich’ CRC: 6329956649787979024
TOKEN: ‘ich+mit’ CRC: 5158416839735805488
[...]

Module OSB (Orthogonal Sparse bigram): for each word, it creates a sliding window of 5 words around the word. So we will associate the word with a neighbor over a radius of -4 / +4 positions around the word.

TOKEN: ‘Heute+#+#+#+mit’ CRC: 2006452661602586241
TOKEN: ‘Abend+#+#+mit’ CRC: 5482652074219693289
TOKEN: ‘war+#+mit’ CRC: 15707817493435847227
TOKEN: ‘ich+mit’ CRC: 5158416839735805488
TOKEN: ‘Abend+#+#+#+meiner’ CRC: 8544044731047037263
TOKEN: ‘war+#+#+meiner’ CRC: 14722667808637756004
[...]

SBPH module (Sparse Binary Polynomial Hashing): similar to OSB, but more flexible, because we will use a sliding window of 5 words, but also consider the intermediate words in the window, and not just ignore them (represented by a '# 'in OSB).

TOKEN: ‘mit’ CRC: 5158417007107899392
TOKEN: ‘ich+mit’ CRC: 5158416839735805488
TOKEN: ‘war+#+mit’ CRC: 15707817493435847227
TOKEN: ‘war+ich+mit’ CRC: 6905336139605378569
TOKEN: ‘Abend+#+#+mit’ CRC: 5482652074219693289
TOKEN: ‘Abend+#+ich+mit’ CRC: 2006454003823721484

Obviously, the dictionnary of SBPH is a lot larger than the one of OSB, which is in turn larger than CHAIN and WORD.

The advantage of tokenizers such as OSB and SBPH is that they can identify phrases that have never been seen before, by using combination ('+' character) and jumping (character'#').

For example, suppose the token “Buy Viagra + # +”. This token is only able to identify phrases such as:

Buy Viagra cheep 
Buy Viagra good 
Buy Viagra Herbal
Buy Viagra exclusive 
Buy Viagra boosting 
Buy Viagra fresh
Buy Viagra qualitative

In the same situation, the WORD tokenizer is able to identify individual words only , and not their combination. The CHAIN tokenizer wouldn’t do almost anything … unless you have all the combinations in the dictionary.

SBPH mechanism also uses a weight in association to the tokens. Thus, a token of 5 words will have a much greater weight than a token of only one word, according to the formula: weight = 2 ^ (2 * n), where n represents the number of neighboring words taken into account.

Still juggling with our Germanic prose, the weight table for the sentence “Heute Abend war ich mit” is as follows:

Token	Weight
Heute	1
Heute+Abend	4
Heute+#+war	4
Heute+Abend+war	16
Heute+#+#+ich	4
Heute+Abend+#+ich	16
Heute+#+war+ich	16
Heute+Abend+war+ich	64
Heute+#+#+#+mit	4
Heute+Abend+#+#+mit	16
Heute+#+war+#+mit	16
Heute+Abend+war+#+mit	64
Heute+#+#+ich+mit	16
Heute+Abend+#+ich+mit	64
Heute+#+war+ich+mit	64
Heute+Abend+war+ich+mit	256

The weight is then used to multiply the impact of the token when calculating probability.

For our configuration, we are going to use the OSB tokenizer. But this should not prevent you from experimenting with SBPH. However, unless you have very specific needs, CHAIN and WORD are probably too primitive for you.

The directive to be placed in 'dspam.conf is therefore:

Tokenizer osb

- Statistical Algorithm

With all these tokens, the challenge is to determine which ones influence the decision, and in what proportions. As we have said, DSPAM does not come with a pre-filled dictionary. It cannot tell immediately if the token '+#+#+Abendmit' is relevant to determining whether the message is spam or not. But it learns, and it will adjust the probabilities associated with each token, and apply it to new emails.

Beyond the mere calculation of probability, the statistical algorithm used to define criteria needs to be taken into account when calculating spam probability. DSPAM gives us the choice between several statistical algorithms that are:

naive: Naive Bayesian-(All Tokens)
graham Graham-Bayesian (“A Plan for Spam”)
burton Burton-Bayesian (SpamProbe)
Chi-square Fisher-Robinson's Chi-Square Algorithm

It is also possible to combine these algorithms. A combination graham+burton generally shows a good false positive / false negative. So that's what we will use, via the following directive:

Algorithm graham burton

But here is a little background to understand the mechanics behind this. The naive approach (of the algorithm of the same name) considers all the tokens composing a message. Each token is initialized with a statistical neutral value, or 0.5, which means neither spam or ham (a value closer to 1 equal spam). But the problem with this simple operation is that a spammer could include a long text containing common words (“and”, “hello”, …), and one or two sentences containing the spam message, and the algorithm will process all tokens at the same level, allowing the “support” text to reduce the likelihood of the final message being spam.

This approach was discussed by Paul Graham, him again, to demonstrate that a more optimal solution is possible. Therefore, the Graham algorithm uses the following criteria:

Analyze the message and selects the 15 most relevant tokens. The tokens selected are those with the highest deviation from the neutral probability 0.5.
Ignore tokens that have been seen less than 5 times in the past.
Use tokens only once. If a token is present twice in the message, the second occurrence will not be taken into account in the calculation.
When adding new tokens, define an initial probability of 0.4 instead of 0.5. This allows DSPAM to biased tokens toward innocence, until proven guilty.

Brian Burton uses a modified version of Graham. The number of tokens considered is increased from 15 to 27, and if a token is relevant several times, then it will be taken into account several times. This algorithm has been first incorporated into the spam software SpamProbe.

Still in the movement of the early 2000s, Gary Robinson published in a 2003 issue of the Linux Journal an improved version of the algorithm of Graham. His own version is at the heart of the SpamBayes project, another classification engine, and presents a more efficient process for tokens that appear infrequently. His approach is based on the statistical test Chi-Square, hence the name of the directive in DSPAM.

It's hard to say which of these algorithms is most appropriate. All are achieving excellent results; feel free to experiment with all of them.

- Calculation of probabilities

So we have tokens, whose initial value is 0.4 using the Graham algorithm, and calculation parameters.

The last step is to calculate the probability that a message is spam or not. DSPAM uses what is called the pValue, and provides three algorithms to perform this calculation.

These statistical algorithms are :

markov: from the Russian mathematician Andrey Markov
robinson: from Gary Robinson
bcr: Bayesian Chain Rule, by Paul Graham

The standard algorithm, the one we will use in our example, is 'bcr' Bayesian Chain Rule, which is also the algorithm described by Paul Graham in his article “A Plan for Spam”. To use it, set the following parameter in dspam.conf:

Pvalue bcr

- Bayesian filtering

When talking about anti-spam technologies, the work of Thomas Bayes is consistently cited. Bayes' theorem is to calculate the probability of occurrence of an event using its recorded occurrence in the past. In other words, it's what we call “experience”.

In our case, the formula to calculate the probability that a message is spam or not is:

P = S / (S + H)

With:

P: (P) robability of the message being spam
S: product of the probabilities associated to each token composing the message: S=P(token-1) * P(token-2) * … *P (token-n)
H: inverse probability of the tokens: H=(1 - P (token-1)) * (1-P (token-2)) * … * (1-P (token-n))

As we saw, when token is added to the dictionary, it takes a default value of 0.4. Whenever DSPAM learns about a message containing the same token, it changes the value. Therefore, if the word “Viagra” is present in 10 messages, and 9 are spam, the probability associated with this token will be: P (Viagra) = 9 / (9 +1) = 0.9

Consider the message “Hi! Buy Viagra.” We will apply a WORD tokenizer to this message (WORD is easier to handle for the example).

The first thing the tokenizer does is to remove the characters not taken into account, such as the exclamation point. The message is then “Hi Buy Viagra”

Each word is a token of its own, one can imagine that the dictionary user is in the following state:

Token	Nb de Spam (s)	Nb de Ham (h)	Probability p=s/(s+h)
Hi	25	62	0.29
Buy	157	87	0.64
Viagra	231	11	0.95

We can calculate the final pvalue of the message with the Bayes formula:

S = 0.29 * 0.64 * 0.95 = 0.176
H = (1-0.29) * (1-0.64) * (1-0.95) = 0.71 * 0.36 * 0.05 = 0.0127
Pvalue = S / (S + H) = 0,176 / (0,176 + 0.0127) = 0.93

So the final probability of the message being spam is 93%.

- Markov

However, in the particular case where the tokenizer is SBPH, it is possible to use the concept of weight of the tokens in the statistical calculation. That is what the 'markov' method does, but it only works if SBPH is enabled (only this tokenizer keeps weight associated to the tokens). Markov is an improved version of bcr that can tell if a token is very specific (eg, 5 of 5 words) and multiply its impact on the pValue (256 times greater than a token with a single word). In short, the weight value of the token is used to multiply the impact of the probability associated with the token in the overall calculation.

- Confidence

DSPAM exports a confidence value of the result produced. The confidence is calculated based on the likelihood that the message is spam or not.

When the message is innocent, the closer the value is to zero, the more confident DSPAM is in its result: confidence is high. Thus, if the message is innocent, confidence equals (1 - probability). (Example: probability = 0.0184, confidence = 1 - 0.0184 = 0.9816).

When the message is spam, the closer the value is to 1, the more confident DSPAM is in its outcome. Thus, if the message is spam, trust equals probability.

That's all for the mathematics. If you like this topic, do not hesitate to continue the discussion on the mailing list.

- Storage driver

- Using the Hash Driver

The default and most straightforward backend to configure is the Hash Driver. It maintains per-user dictionaries of tokens in the user's folder. To use this driver, set the following parameter at the beginning of dspam.conf.

StorageDriver /usr/lib/dspam/libhash_drv.so

<note>The examples in this document are based on the Hash Driver, but can easily be transferred to any other backend.</note> Tokens that DSPAM generates take up space, lots of space. When using the Hash driver, DSPAM can set the maximum size of the hash file that each user will use (its dictionary), and with a tokenizer such as OSB, you must ensure it will be large enough.

For example, a rather active account, receiving between 200 and 300 messages a day will generate roughly 2.5 million tokens in the space of two weeks. Obviously, this value will vary greatly depending on whether the messages contain the same tokens or not.

By setting the value of 'HashRecMax' to over 6 million entries, it gives some leeway to DSPAM, but we will however give it the possibility of increasing this value up to 16 million (in increments of 50000), just in case.

HashRecMax 6291469 
HashAutoExtend on
HashMaxExtents 10000000
HashExtentSize 49157

It also means that the file hash of a user will be initialized with a size close to 100MB! This can be a problem on a system managing a large number of users.

- Using the Postgresql Driver

While the examples in this documentation are mostly based on the Hash Driver, you will probably chose to use another type of backend. DSPAM works extremely well with a Postgresql backend, and this is the recommended setup.

To use the Postgresql driver, set the following parameter at the beginning of dspam.conf:

StorageDriver /usr/lib/dspam/libpgsql_drv.so

Let's take a closer look at the configuration procedure.

- Granting access to the database

Assuming Postgresql (v8.4 in this example) is installed, the first step is to create a database for dspam, and grant access to the user 'dspam'.

In the command line, create and empty database named dspam:

# su postgres
postgres@server:/$ psql 
psql (8.4.7)
Type "help" for help.
postgres=# create role dspam login;
CREATE ROLE
postgres=# alter role dspam password '309dj20ejd903j';
ALTER ROLE
postgres=# create database dspam owner dspam;
CREATE DATABASE

Then edit /etc/postgresql/8.4/main/pg_hba.conf to grant access to user dspam:

ramiel:/home/julien/dspam# cd /etc/postgresql/8.4/main/
ramiel:/etc/postgresql/8.4/main# vim pg_hba.conf

[...]

# TYPE  DATABASE    USER        CIDR-ADDRESS          METHOD
local   dspam       dspam                             password

You can then connect to postgres from user dspam (make sure user dspam as a login shell such as /bin/bash in /etc/passwd, otherwise 'su' won't work).

server:/# su dspam
dspam@server:/$ psql -d dspam -U dspam -h localhost
psql (8.4.7)
Type "help" for help.

dspam=> \du
            List of roles
 Role name | Attributes  | Member of 
-----------+-------------+-----------
 dspam     |             | {}
 postgres  | Superuser   | {}
           : Create role   
           : Create DB     

dspam=> \q
dspam@server:/$

You can try to create a test table to check that dspam user has the appropriate permissions:

dspam=> create table test (test int);
CREATE TABLE
dspam=> \d
       List of relations
 Schema | Name | Type  | Owner 
--------+------+-------+-------
 public | test | table | dspam
(1 row)

dspam=> drop table test;
DROP TABLE

- Create the database schema

The database schemas are located in the source code of DSPAM, in the folder src/tools.pgsql_drv.

However, before imported the schemas, we are going to create a procedural language in the DSPAM database. This is done using the command below: <note>The createlang command is a shell command, you need to execute this on the command line of your server, not in the postgresql prompt.</note>

dspam@server:/$ createlang plpgsql dspam

Now go back to the Postgresql prompt and import the schemas pgsql_objects.sql and virtual_users.sql:

dspam=> \i /home/julien/dspam-3.9.1-RC1/src/tools.pgsql_drv/pgsql_objects.sql

[... tables and sequences creation output ...]

dspam=> \i /home/julien/dspam-3.9.1-RC1/src/tools.pgsql_drv/virtual_users.sql

<note>You might receive some warnings when the import scripts try to perform and 'analyze' and doesn't have the permissions to do so. You can safely ignore this.</note> The dspam database should then be in the following state (tables and indexes):

dspam=> \d
                 List of relations
 Schema |          Name          |   Type   | Owner 
--------+------------------------+----------+-------
 public | dspam_preferences      | table    | dspam
 public | dspam_signature_data   | table    | dspam
 public | dspam_stats            | table    | dspam
 public | dspam_token_data       | table    | dspam
 public | dspam_virtual_uids     | table    | dspam
 public | dspam_virtual_uids_seq | sequence | dspam
(6 rows)

dspam=> \di
                              List of relations
 Schema |             Name             | Type  | Owner |        Table         
--------+------------------------------+-------+-------+----------------------
 public | dspam_preferences_uid_key    | index | dspam | dspam_preferences
 public | dspam_signature_data_uid_key | index | dspam | dspam_signature_data
 public | dspam_stats_pkey             | index | dspam | dspam_stats
 public | dspam_token_data_uid_key     | index | dspam | dspam_token_data
 public | dspam_virtual_uids_pkey      | index | dspam | dspam_virtual_uids
 public | id_virtual_uids_01           | index | dspam | dspam_virtual_uids
 public | id_virtual_uids_02           | index | dspam | dspam_virtual_uids
(7 rows)

- Configure DSPAM to connect to Postgresql

The last step is simply to feed dspam.conf with the parameters to connect to the database. The configuration file comes with a Postgresql section where you can uncomment the configuration parameters and set the proper values:

# --- PostgreSQL ---

# For PgSQLServer you can Use a TCP/IP address or a socket. If your socket is
# in /var/run/postgresql/.s.PGSQL.5432 specify just the path where the socket
# resits (without .s.PGSQL.5432).

PgSQLServer    127.0.0.1
PgSQLPort      5432
PgSQLUser      dspam
PgSQLPass      309dj20ejd903j
PgSQLDb        dspam

# If you're running DSPAM in client/server (daemon) mode, uncomment the
# setting below to override the default connection cache size (the number
# of connections the server pools between all clients).
#
PgSQLConnectionCache	3

Upon restart, DSPAM will create 3 connections to the Postgresql database.

dspam    19333     1  0 Apr01 ?        00:16:21 /usr/bin/dspam --daemon
postgres 19334  9851  0 Apr01 ?        00:03:12 postgres: dspam dspam 127.0.0.1(57278) idle                                                                                 
postgres 19337  9851  0 Apr01 ?        00:49:12 postgres: dspam dspam 127.0.0.1(57279) idle                                                                                 
postgres 19341  9851  0 Apr01 ?        00:06:11 postgres: dspam dspam 127.0.0.1(57280) idle

- Whitelist

DSPAM has the opportunity to observe the sender of messages for a given recipient, and create a whitelist of senders that have sent more than 20 emails where none have been flagged as spam. This feature, quite handy, does not need any other configuration than:

Feature whitelist

- The preferences

Each user can parameter its own preferences via the web interface (we will install it later). However, it is possible to set default values for those preferences.

For example, the default configuration does not deliver spam to users, but place them in quarantine. To change this behavior, we modify the following parameters in dspam.conf:

Preference "spamAction=tag"     # { quarantine | tag | deliver } -> default:quarantine
Preference "spamSubject=[SPAM]" # { string } -> default:[SPAM]
Preference "tagSpam=on"         # { on | off }
Preference "tagNonspam=off"     # { on | off }

There are many of those preferences, You can decide to leave the possibility to the users to modify them by setting:

AllowOverride spamAction
AllowOverride spamSubject
AllowOverride tagSpam
AllowOverride tagNonspam

It is also possible to remove the DSPAM signature from messages via this preference:

Preference “signatureLocation=message”  # { message | headers } -> default:message

However, this signature is quite handy for re-training messages, as we shall see later. So it's recommended to leave it until you have a better solution to retrain spam.

- Ignore some headers

Since DSPAM will take the entire email into accounts when calculating probabilities, it might be interesting to ignore some specific headers. For example, another antispam's headers, a DKIM signature, a date or a user agent might not be very useful to determine whether or not an email is a spam.

The configuration example that follow include an extensive list of headers that can be safely ignored. Feel free to expand/reduce this list.

- dspam.conf

Your final configuration file should look like the listing below. Many options are configurable, but for a quick overview, this configuration is functional. Note that we are using the Hash Driver. If you want to use another backend, you need to edit this configuration.

Home /var/spool/dspam/
StorageDriver /usr/lib/dspam/libhash_drv.so
TrustedDeliveryAgent "/usr/bin/procmail"
DeliveryHost            127.0.0.1
DeliveryPort            10034
DeliveryIdent           localhost
DeliveryProto           SMTP
OnFail error
Trust root
Trust dspam
TrainingMode teft
TestConditionalTraining on
Feature whitelist
Feature tb=5
Algorithm graham burton
Tokenizer osb
Pvalue bcr
WebStats on
Preference "trainingMode=TEFT"
Preference "spamAction=tag"
Preference "spamSubject=[SPAM]"
Preference "statisticalSedation=5"
Preference "enableBNR=on"
Preference "enableWhitelist=on"
Preference "signatureLocation=message"
Preference "tagSpam=on"
Preference "tagNonspam=off"
Preference "showFactors=on"
Preference "optIn=off"
Preference "optOut=off"
Preference "whitelistThreshold=20"
Preference "makeCorpus=off"
Preference "storeFragments=off"
Preference "localStore="
Preference "processorBias=on"
Preference "fallbackDomain=off"
Preference "trainPristine=off"
Preference "optOutClamAV=off"
Preference "ignoreRBLLookups=off"
Preference "RBLInoculate=off"
Preference "notifications=on"
AllowOverride enableBNR
AllowOverride enableWhitelist
AllowOverride fallbackDomain
AllowOverride ignoreGroups
AllowOverride ignoreRBLLookups
AllowOverride localStore
AllowOverride makeCorpus
AllowOverride optIn
AllowOverride optOut
AllowOverride optOutClamAV
AllowOverride processorBias
AllowOverride RBLInoculate
AllowOverride showFactors
AllowOverride signatureLocation
AllowOverride spamAction
AllowOverride spamSubject
AllowOverride statisticalSedation
AllowOverride storeFragments
AllowOverride tagNonspam
AllowOverride tagSpam
AllowOverride trainPristine
AllowOverride trainingMode
AllowOverride whitelistThreshold
AllowOverride dailyQuarantineSummary
AllowOverride notifications
HashRecMax              6291469
HashAutoExtend          on
HashMaxExtents          10000000
HashExtentSize          49157
HashPctIncrease         10
HashMaxSeek             10
HashConnectionCache     10
Notifications   on
IgnoreHeader Accept-Language
IgnoreHeader Approved
IgnoreHeader Archive
IgnoreHeader Authentication-Results
IgnoreHeader Cache-Post-Path
IgnoreHeader Cancel-Key
IgnoreHeader Cancel-Lock
IgnoreHeader Complaints-To
IgnoreHeader Content-Description
IgnoreHeader Content-Disposition
IgnoreHeader Content-ID
IgnoreHeader Content-Language
IgnoreHeader Content-Return
IgnoreHeader Content-Transfer-Encoding
IgnoreHeader Content-Type
IgnoreHeader DKIM-Signature
IgnoreHeader Date
IgnoreHeader Disposition-Notification-To
IgnoreHeader DomainKey-Signature
IgnoreHeader Importance
IgnoreHeader In-Reply-To
IgnoreHeader Injection-Info
IgnoreHeader Lines
IgnoreHeader List-Archive
IgnoreHeader List-Help
IgnoreHeader List-Id
IgnoreHeader List-Post
IgnoreHeader List-Subscribe
IgnoreHeader List-Unsubscribe
IgnoreHeader Message-ID
IgnoreHeader Message-Id
IgnoreHeader NNTP-Posting-Date
IgnoreHeader NNTP-Posting-Host
IgnoreHeader Newsgroups
IgnoreHeader OpenPGP
IgnoreHeader Organization
IgnoreHeader Originator
IgnoreHeader PGP-ID
IgnoreHeader Path
IgnoreHeader Received
IgnoreHeader Received-SPF
IgnoreHeader References
IgnoreHeader Reply-To
IgnoreHeader Resent-Date
IgnoreHeader Resent-From
IgnoreHeader Resent-Message-ID
IgnoreHeader Thread-Index
IgnoreHeader Thread-Topic
IgnoreHeader User-Agent
IgnoreHeader X--MailScanner-SpamCheck
IgnoreHeader X-AV-Scanned
IgnoreHeader X-AVAS-Spam-Level
IgnoreHeader X-AVAS-Spam-Score
IgnoreHeader X-AVAS-Spam-Status
IgnoreHeader X-AVAS-Spam-Symbols
IgnoreHeader X-AVAS-Virus-Status
IgnoreHeader X-AVK-Virus-Check
IgnoreHeader X-Abuse
IgnoreHeader X-Abuse-Contact
IgnoreHeader X-Abuse-Info
IgnoreHeader X-Abuse-Management
IgnoreHeader X-Abuse-To
IgnoreHeader X-Abuse-and-DMCA-Info
IgnoreHeader X-Accept-Language
IgnoreHeader X-Admission-MailScanner-SpamCheck
IgnoreHeader X-Admission-MailScanner-SpamScore
IgnoreHeader X-Amavis-Alert
IgnoreHeader X-Amavis-Hold
IgnoreHeader X-Amavis-Modified
IgnoreHeader X-Amavis-OS-Fingerprint
IgnoreHeader X-Amavis-PenPals
IgnoreHeader X-Amavis-PolicyBank
IgnoreHeader X-AntiVirus
IgnoreHeader X-Antispam
IgnoreHeader X-Antivirus
IgnoreHeader X-Antivirus-Scanner
IgnoreHeader X-Antivirus-Status
IgnoreHeader X-Archive
IgnoreHeader X-Assp-Spam-Prob
IgnoreHeader X-Attention
IgnoreHeader X-BTI-AntiSpam
IgnoreHeader X-Barracuda
IgnoreHeader X-Barracuda-Bayes
IgnoreHeader X-Barracuda-Spam-Flag
IgnoreHeader X-Barracuda-Spam-Report
IgnoreHeader X-Barracuda-Spam-Score
IgnoreHeader X-Barracuda-Spam-Status
IgnoreHeader X-Barracuda-Virus-Scanned
IgnoreHeader X-BeenThere
IgnoreHeader X-Bogosity
IgnoreHeader X-Brightmail-Tracker
IgnoreHeader X-CRM114-CacheID
IgnoreHeader X-CRM114-Status
IgnoreHeader X-CRM114-Version
IgnoreHeader X-CTASD-IP
IgnoreHeader X-CTASD-RefID
IgnoreHeader X-CTASD-Sender
IgnoreHeader X-Cache
IgnoreHeader X-ClamAntiVirus-Scanner
IgnoreHeader X-Comment-To
IgnoreHeader X-Comments
IgnoreHeader X-Complaints
IgnoreHeader X-Complaints-Info
IgnoreHeader X-Complaints-To
IgnoreHeader X-DKIM
IgnoreHeader X-DMCA-Complaints-To
IgnoreHeader X-DMCA-Notifications
IgnoreHeader X-Despammed-Tracer
IgnoreHeader X-ELTE-SpamCheck
IgnoreHeader X-ELTE-SpamCheck-Details
IgnoreHeader X-ELTE-SpamScore
IgnoreHeader X-ELTE-SpamVersion
IgnoreHeader X-ELTE-VirusStatus
IgnoreHeader X-Enigmail-Supports
IgnoreHeader X-Enigmail-Version
IgnoreHeader X-Evolution-Source
IgnoreHeader X-Extra-Info
IgnoreHeader X-FSFE-MailScanner
IgnoreHeader X-FSFE-MailScanner-From
IgnoreHeader X-Face
IgnoreHeader X-Fellowship-MailScanner
IgnoreHeader X-Fellowship-MailScanner-From
IgnoreHeader X-Forwarded
IgnoreHeader X-GMX-Antispam
IgnoreHeader X-GMX-Antivirus
IgnoreHeader X-GPG-Fingerprint
IgnoreHeader X-GPG-Key-ID
IgnoreHeader X-GPS-DegDec
IgnoreHeader X-GPS-MGRS
IgnoreHeader X-GWSPAM
IgnoreHeader X-Gateway
IgnoreHeader X-Greylist
IgnoreHeader X-HTMLM
IgnoreHeader X-HTMLM-Info
IgnoreHeader X-HTMLM-Score
IgnoreHeader X-HTTP-Posting-Host
IgnoreHeader X-HTTP-UserAgent
IgnoreHeader X-HTTP-Via
IgnoreHeader X-Headers-End
IgnoreHeader X-ID
IgnoreHeader X-IMAIL-SPAM-STATISTICS
IgnoreHeader X-IMAIL-SPAM-URL-DBL
IgnoreHeader X-IMAIL-SPAM-VALFROM
IgnoreHeader X-IMAIL-SPAM-VALHELO
IgnoreHeader X-IMAIL-SPAM-VALREVDNS
IgnoreHeader X-Info
IgnoreHeader X-IronPort-Anti-Spam-Filtered
IgnoreHeader X-IronPort-Anti-Spam-Result
IgnoreHeader X-KSV-Antispam
IgnoreHeader X-Kaspersky-Antivirus
IgnoreHeader X-MDAV-Processed
IgnoreHeader X-MDRemoteIP
IgnoreHeader X-MDaemon-Deliver-To
IgnoreHeader X-MIE-MailScanner-SpamCheck
IgnoreHeader X-MIMEOLE
IgnoreHeader X-MIMETrack
IgnoreHeader X-MMS-Spam-Filter-ID
IgnoreHeader X-MS-Exchange-Forest-RulesExecuted
IgnoreHeader X-MS-Exchange-Organization-Antispam-Report
IgnoreHeader X-MS-Exchange-Organization-AuthAs
IgnoreHeader X-MS-Exchange-Organization-AuthDomain
IgnoreHeader X-MS-Exchange-Organization-AuthMechanism
IgnoreHeader X-MS-Exchange-Organization-AuthSource
IgnoreHeader X-MS-Exchange-Organization-Journal-Report
IgnoreHeader X-MS-Exchange-Organization-Original-Scl
IgnoreHeader X-MS-Exchange-Organization-Original-Sender
IgnoreHeader X-MS-Exchange-Organization-OriginalArrivalTime
IgnoreHeader X-MS-Exchange-Organization-OriginalSize
IgnoreHeader X-MS-Exchange-Organization-PCL
IgnoreHeader X-MS-Exchange-Organization-Quarantine
IgnoreHeader X-MS-Exchange-Organization-SCL
IgnoreHeader X-MS-Exchange-Organization-SenderIdResult
IgnoreHeader X-MS-Has-Attach
IgnoreHeader X-MS-TNEF-Correlator
IgnoreHeader X-MSMail-Priority
IgnoreHeader X-MailScanner
IgnoreHeader X-MailScanner-Information
IgnoreHeader X-MailScanner-SpamCheck
IgnoreHeader X-Mailer
IgnoreHeader X-Mailman-Version
IgnoreHeader X-Mlf-Spam-Status
IgnoreHeader X-NAI-Spam-Checker-Version
IgnoreHeader X-NAI-Spam-Flag
IgnoreHeader X-NAI-Spam-Level
IgnoreHeader X-NAI-Spam-Report
IgnoreHeader X-NAI-Spam-Route
IgnoreHeader X-NAI-Spam-Rules
IgnoreHeader X-NAI-Spam-Score
IgnoreHeader X-NAI-Spam-Threshold
IgnoreHeader X-NEWT-spamscore
IgnoreHeader X-NNTP-Posting-Date
IgnoreHeader X-NNTP-Posting-Host
IgnoreHeader X-NetcoreISpam1-ECMScanner
IgnoreHeader X-NetcoreISpam1-ECMScanner-From
IgnoreHeader X-NetcoreISpam1-ECMScanner-Information
IgnoreHeader X-NetcoreISpam1-ECMScanner-SpamCheck
IgnoreHeader X-NetcoreISpam1-ECMScanner-SpamScore
IgnoreHeader X-Newsreader
IgnoreHeader X-Newsserver
IgnoreHeader X-No-Archive
IgnoreHeader X-No-Spam
IgnoreHeader X-OSBF-Lua-Score
IgnoreHeader X-OWM-SpamCheck
IgnoreHeader X-OWM-VirusCheck
IgnoreHeader X-Olypen-Virus
IgnoreHeader X-Orig-Path
IgnoreHeader X-OriginalArrivalTime
IgnoreHeader X-Originating-IP
IgnoreHeader X-PAA-AntiVirus
IgnoreHeader X-PAA-AntiVirus-Message
IgnoreHeader X-PGP-Fingerprint
IgnoreHeader X-PGP-Hash
IgnoreHeader X-PGP-ID
IgnoreHeader X-PGP-Key
IgnoreHeader X-PGP-Key-Fingerprint
IgnoreHeader X-PGP-KeyID
IgnoreHeader X-PGP-Sig
IgnoreHeader X-PIRONET-NDH-MailScanner-SpamCheck
IgnoreHeader X-PIRONET-NDH-MailScanner-SpamScore
IgnoreHeader X-PMX
IgnoreHeader X-PMX-Version
IgnoreHeader X-PN-SPAMFiltered
IgnoreHeader X-Posting-Agent
IgnoreHeader X-Posting-ID
IgnoreHeader X-Posting-IP
IgnoreHeader X-Priority
IgnoreHeader X-Proofpoint-Spam-Details
IgnoreHeader X-Qmail-Scanner-1.25st
IgnoreHeader X-Quarantine-ID
IgnoreHeader X-RAV-AntiVirus
IgnoreHeader X-RITmySpam
IgnoreHeader X-RITmySpam-IP
IgnoreHeader X-RITmySpam-Spam
IgnoreHeader X-Rc-Spam
IgnoreHeader X-Rc-Virus
IgnoreHeader X-Received-Date
IgnoreHeader X-RedHat-Spam-Score
IgnoreHeader X-RedHat-Spam-Warning
IgnoreHeader X-RegEx
IgnoreHeader X-RegEx-Score
IgnoreHeader X-Rocket-Spam
IgnoreHeader X-SA-GROUP
IgnoreHeader X-SA-RECEIPTSTATUS
IgnoreHeader X-STA-NotSpam
IgnoreHeader X-STA-Spam
IgnoreHeader X-Scam-grey
IgnoreHeader X-Scanned-By
IgnoreHeader X-Sender
IgnoreHeader X-SenderID
IgnoreHeader X-Sohu-Antivirus
IgnoreHeader X-Spam
IgnoreHeader X-Spam-ASN
IgnoreHeader X-Spam-Check
IgnoreHeader X-Spam-Checked-By
IgnoreHeader X-Spam-Checker
IgnoreHeader X-Spam-Checker-Version
IgnoreHeader X-Spam-Clean
IgnoreHeader X-Spam-DCC
IgnoreHeader X-Spam-Details
IgnoreHeader X-Spam-Filter
IgnoreHeader X-Spam-Filtered
IgnoreHeader X-Spam-Flag
IgnoreHeader X-Spam-Level
IgnoreHeader X-Spam-OrigSender
IgnoreHeader X-Spam-Pct
IgnoreHeader X-Spam-Prev-Subject
IgnoreHeader X-Spam-Processed
IgnoreHeader X-Spam-Pyzor
IgnoreHeader X-Spam-Rating
IgnoreHeader X-Spam-Report
IgnoreHeader X-Spam-Scanned
IgnoreHeader X-Spam-Score
IgnoreHeader X-Spam-Status
IgnoreHeader X-Spam-Tagged
IgnoreHeader X-Spam-Tests
IgnoreHeader X-Spam-Tests-Failed
IgnoreHeader X-Spam-Virus
IgnoreHeader X-Spam-Warning
IgnoreHeader X-Spam-detection-level
IgnoreHeader X-SpamAssassin-Clean
IgnoreHeader X-SpamAssassin-Warning
IgnoreHeader X-SpamBouncer
IgnoreHeader X-SpamCatcher-Score
IgnoreHeader X-SpamCop-Checked
IgnoreHeader X-SpamCop-Disposition
IgnoreHeader X-SpamCop-Whitelisted
IgnoreHeader X-SpamDetected
IgnoreHeader X-SpamInfo
IgnoreHeader X-SpamPal
IgnoreHeader X-SpamPal-Timeout
IgnoreHeader X-SpamReason
IgnoreHeader X-SpamScore
IgnoreHeader X-SpamTest-Categories
IgnoreHeader X-SpamTest-Info
IgnoreHeader X-SpamTest-Method
IgnoreHeader X-SpamTest-Status
IgnoreHeader X-SpamTest-Version
IgnoreHeader X-Spamadvice
IgnoreHeader X-Spamarrest-noauth
IgnoreHeader X-Spamarrest-speedcode
IgnoreHeader X-Spambayes-Classification
IgnoreHeader X-Spamcount
IgnoreHeader X-Spamsensitivity
IgnoreHeader X-TERRACE-SPAMMARK
IgnoreHeader X-TERRACE-SPAMRATE
IgnoreHeader X-TM-AS-Category-Info
IgnoreHeader X-TM-AS-MatchedID
IgnoreHeader X-TM-AS-Product-Ver
IgnoreHeader X-TM-AS-Result
IgnoreHeader X-TMWD-Spam-Summary
IgnoreHeader X-TNEFEvaluated
IgnoreHeader X-Text-Classification
IgnoreHeader X-Text-Classification-Data
IgnoreHeader X-Trace
IgnoreHeader X-UCD-Spam-Score
IgnoreHeader X-User-Agent
IgnoreHeader X-User-ID
IgnoreHeader X-User-System
IgnoreHeader X-Virus-Check
IgnoreHeader X-Virus-Checked
IgnoreHeader X-Virus-Checker-Version
IgnoreHeader X-Virus-Scan
IgnoreHeader X-Virus-Scanned
IgnoreHeader X-Virus-Scanner
IgnoreHeader X-Virus-Scanner-Result
IgnoreHeader X-Virus-Status
IgnoreHeader X-VirusChecked
IgnoreHeader X-Virusscan
IgnoreHeader X-WSS-ID
IgnoreHeader X-WinProxy-AntiVirus
IgnoreHeader X-WinProxy-AntiVirus-Message
IgnoreHeader X-Yandex-Forward
IgnoreHeader X-Yandex-Front
IgnoreHeader X-Yandex-Spam
IgnoreHeader X-Yandex-TimeMark
IgnoreHeader X-cid
IgnoreHeader X-iHateSpam-Checked
IgnoreHeader X-iHateSpam-Quarantined
IgnoreHeader X-policyd-weight
IgnoreHeader X-purgate
IgnoreHeader X-purgate-Ad
IgnoreHeader X-purgate-ID
IgnoreHeader X-sgxh1
IgnoreHeader X-to-viruscore
IgnoreHeader Xref
IgnoreHeader acceptlanguage
IgnoreHeader thread-index
IgnoreHeader x-uscspam
PurgeSignatures 14
PurgeNeutral    90
PurgeUnused     90
PurgeHapaxes    30
PurgeHits1S     15
PurgeHits1I     15
LocalMX 127.0.0.1
SystemLog       on
UserLog         on
Opt out
ServerHost              127.0.0.1
ServerPort              10033
ServerQueueSize 32
ServerPID               /var/run/dspam.pid
ServerMode auto
ServerParameters        "--deliver=innocent,spam -d %u"
ServerIdent             "localhost.localdomain"
ProcessorURLContext on
ProcessorBias on
StripRcptDomain off

- A quick test that will not work

To start the daemon as user 'dspam', the Debian standard method is to use start-stop-daemon, as follows:

# start-stop-daemon --start --chuid dspam --exec /usr/bin/dspam -- --daemon

<note> DSPAM automatically creates its pid in /var/run. Make sure the user dspam can write in this directory. </note> We get a process started and a listening port:

UID        PID  PPID  C STIME TTY          TIME CMD
dspam    27473     1  0 03:26 pts/0    00:00:00 /usr/bin/dspam --daemon
Proto Recv-Q Send-Q Local Address    Foreign Address State  User Inode  PID/Program name 
tcp     0            0          127.0.0.1:10033  0.0.0.0:*           LISTEN 999  18244  27473/dspam

The daemon responds on this port, therefore, we can see what happens when trying to send an email:

$ nc localhost 10033
220 DSPAM LMTP 3.9.1 Ready
lhlo mail
250-localhost.localdomain
250-PIPELINING
250-ENHANCEDSTATUSCODES
250-8BITMIME
250 SIZE
mail from:<jp.troll@gmail.com>
250 2.1.0 OK
rcpt to:<jean-kevin@debian.lab>
250 2.1.5 OK
data
354 Enter mail, end with « . » on a line by itself
From: Jean-Pierre Troll <jp.troll@gmail.com>
To: Jean-Kevin De La Motte <jean-kevin@debian.lab>
Subject: This is Not a Spam
might be a troll, but a spam... no!
.
421 4.3.0 <jean-kevin@debian.lab> Unable to connect to server quit
221 2.0.0 OK

DSPAM accepts our message but seems to have trouble sending it back to the SMTP server, which is quite normal because we have not configured Postfix yet. However, let's take a look at the home directory of DSPAM. It has created a tree for the user in /var/spool/dspam/data/debian.lab/jean-kevin/:

# tree -s
.
+-- [         23]  data
¦ +-- [         23]  debian.lab
¦     +-- [        114]  jean-kevin
¦         +-- [  100663544]  jean-kevin.css
¦         +-- [          0]  jean-kevin.lock
¦         +-- [         85]  jean-kevin.log
¦         +-- [         40]  jean-kevin.sig
¦         ¦ +-- [        384]  4c873bcd274731106759975.sig
¦         +-- [         12]  jean-kevin.stats
+-- [          6]  log
+-- [        115]  system.log

Look more closely at these files, you have a file 'jean-kevin.css, which size, 100MB, was specified as the hash file size in dspam.conf.

Then, the file 'jean-kevin.log' contains a log of processed messages. There, we find traces of our message:

# cat jean-kevin.log 
1283931466 I Jean-Pierre Troll <jp.troll@gmail.com> 4c873d4a274731062016872 This Is Not A Spam Delivered

Each row has six columns: a unix timestamp, an inspection status (I for inspected, W for whitelisted …), the sender's name and email, an email identifier (DSPAM signature), the message subject and finally the DSPAM status. In this example, the message is marked 'Delivered' because, despite the incapacity of DSPAM to connect to Postfix, the message is considered valid.

When jean-kevin wants to re-train a message as spam or ham, DSPAM will take the signature, look for a file with this name in 'jean-kevin.sig', and update 'jean-kevin.css' with the tokens contained within the file.

This DSPAM configuration is functional, we now configure the communication with Postfix.

- Configure Postfix to connect with DSPAM

Postfix has a generic method for communicating with software such as DSPAM. That is to treat it as a Content-Filter. Postfix can very easily forward a received message to a content-filter configured in the master.cf file.

On a blank configuration of Postfix, you can add the content-filter directly into the principal smtp service (the one that listens on port TCP/25). For this, we must modify /etc/postfix/master.cf like this:

# Postfix master process configuration file.  For details on the format 
# of the file, see the master(5) manual page (command: « man 5 master »).
#
#
===============================================================
# service type  private unpriv  chroot  wakeup  maxproc command + args 
#               (yes)   (yes) (yes)   (never) (100)
#
===============================================================
smtp      inet  n       -       -       -       -       smtpd
      -o content_filter=lmtp:127.0.0.1:10033

This suffices to have Postfix send incoming emails to DSPAM. However, to configure the way back, we have to open a new service in master.cf that listens on port TCP/10034. This time add the new lines at the end of master.cf.

127.0.0.1:10034 inet n  -       n       -        -      smtpd
      -o content_filter=
      -o receive_override_options=no_unknown_recipient_checks,no_header_body_checks
      -o smtpd_helo_restrictions=
      -o smtpd_client_restrictions=
      -o smtpd_sender_restrictions=
      -o smtpd_recipient_restrictions=permit_mynetworks,reject
      -o mynetworks=127.0.0.0/8
      -o smtpd_authorized_xforward_hosts=127.0.0.0/8

Reload postfix with 'postfix reload'. Receiving emails should now work. Repeat the previous test with netcat on localhost, and you should receive the message. To debug, check the following files (on Debian):

/var/log/mail.info contains all logs related to the processing of emails
/var/spool/dspam/system.log contains the overall activity of DSPAM (one line per message processed)
if you compiled with the debug mode, then set 'Debug *' in dspam.conf and you will get detailed logs in /var/spool/dspam/log/
and, in the worst case scenario, use 'tcpdump -s 16436 -SvnXi lo tcp and port 10033' (or 10034) to listen to communication between Postfix and DSPAM

After the mail is passed from Postfix to DSPAM and back to Postfix, it should be received by the recipient as follows:

From jp.troll@gmail.com  
Wed Sep  8 04:02:27 2010 
Return-Path: <jp.troll@gmail.com>
X-Original-To: jean-kevin@debian.lab
Delivered-To: jean-kevin@debian.lab
From: Jean-Pierre Troll <jp.troll@gmail.com>
To: Jean-Kevin De La Motte <jean-kevin@debian.lab> 
Subject: This is Not a Spam
Date: Wed,  8 Sep 2010 03:56:49 -0400 (EDT)
X-DSPAM-Result: Innocent
X-DSPAM-Processed: Wed Sep  8 04:02:27 2010
X-DSPAM-Confidence: 0.9899
X-DSPAM-Probability: 0.0000 
X-DSPAM-Signature: 4c874313289291828119542

might be a troll, but a spam... no!
!DSPAM:4c874313289291828119542!

The message is 'innocent', as described in 'X-DSPAM-Result'.

'X-DSPAM-Probability' tells us the probability that the message is spam (the closer the value is to 1, the higher the probability of the message being spam).

Finally, 'X-DSPAM-Confidence' indicates the confidence level of the filter.

If you want more details on the tests performed and the tokens included, enable the preference 'showFactors = on'. It's wordy, but instructive. Each token is then listed with the associated statistical value.

X-DSPAM-Factors: 27,
To*La+#+#+kevin, 0.01000,
Subject*This+#+#+a, 0.01000,
To*La+#+<jean, 0.01000,
To*Kevin+#+La, 0.01000,
To*Motte+<jean, 0.01000
[...]

The message body also contains the signature as ”!DSPAM: <signature>!”. As mentioned previously, it is preferable to retain the signature in the body of the message because, in this way, it is not deleted when forwarding for training. The other option would be to place the signature in the headers only, but these are usually removed by user agents when a message is forwarded.

- Managing false positives and false-negative

Obviously, you shouldn't expect DSPAM to get everything perfect right away. It must be fed and learn.

First, it is possible to feed DSPAM via the command line using the signature of message. We can report our previous email as spam via the command:

# dspam --source=error --class=spam --user jean-kevin@debian.lab --signature=’4c874313289291828119542'

In the logs of the user, we will see that the message was 'retrained' based on the specified class: spam or innocent.

# tail -n 1 jean-kevin.log
1283934571 M <Not Specified> 4c874313289291828119542 <Not Specified> Retrained

This is certainly not the best solution when you have 15,000 users. It is possible to do better by forwarding spam to {spam|notspam}-<user>@<domain> (eg. spam-jean-kevin@debian.lab), or through the web interface. Both leave control in the user's hands.

- Learning in forward mode

Training in forward mode works as follows: when DSPAM inspects a message, it sets a signature in the message body. A user can then forward the same message to DSPAM indicating that it made the wrong decision.

For this to work, DSPAM needs two things; the message signature and the identity of the user.

The signing allows DSPAM to find the message in its history and record the change of state. Without this signature, DSPAM is not able to identify the message in its history.(Note: the history is preserved 14 days by default. This is set with 'PurgeSignatures'. More on that later).

The identity of the user can be automatically deduced by DSPAM. It will use the added prefix and user email from {spam|notspam}-<email address>. Our Users 'jean-kevin@debian.lab' will have two aliases 'spam-jean-kevin@debian.lab' and 'notspam-jean-kevin@debian.lab' which will be dedicated to re-training.

DSPAM has a feature to re-train when an email is automatically issued to those aliases. In fact, for each incoming message, it will look at the 'To:' header of the body of the message, and if the spam contains {spam|notspam} it will analyze the content and trigger a 'retrain'. The configuration of this function is quite basic, it goes through the following three directives in 'dspam.conf':

ParseToHeaders on
ChangeModeOnParse on
ChangeUserOnParse full

The directive 'ParseToHeaders' informs DSPAM to cut the 'To:' header of the email received to determine if the message contains the keywords {spam|notspam}. This 'To:' header is part of the message body, do not confuse it with the SMTP command “rcpt to”.

With parsing enabled, DSPAM can change the mode of learning according to the first part of the 'To:' field. This is controlled by 'ChangeModeOnParse', which will enable the class 'spam' if the address is 'spam-*' and class 'innocent' if the address is 'notspam-*'.

Finally, 'ChangeUserOnParse' tells DSPAM that the remaining portion of the email address contains the ID of the DSPAM user. Setting it to Full, tells DSPAM to take the user and domain as an identifier, for example 'jean-kevin@debian.lab'.

We must now tell Postfix that users 'spam-jean-kevin@debian.lab' and 'notspam-jean-kevin@debian.lab' exist. In a production environment, you'll certainly have a SQL database or LDAP directory to manage aliases, but in our case, we will simply create two entries in /etc/aliases. This will be sufficient for testing.

# vim /etc/aliases
[...]
spam-jean-kevin: jean-kevin
notspam-jean-kevin: jean-kevin
# postalias /etc/aliases

We can now reconnect to Postfix via netcat and inject the same email as above, but now address it to the spam alias. The headers can be ignored, the important sections are the To: Header and the DSPAM signature at the end of the message body.

$ nc localhost 25
220 debian.lab ESMTP Postfix (Debian/GNU)
ehlo mail
250-debian.lab
250-PIPELINING
250-SIZE 10240000
250-VRFY
250-ETRN
250-STARTTLS
250-ENHANCEDSTATUSCODES
250-8BITMIME
250 DSN
mail from:<jean-kevin@debian.lab>
250 2.1.0 Ok
rcpt to:<spam-jean-kevin@debian.lab>
250 2.1.5 Ok
data
354 End data with <CR><LF>.<CR><LF>
From:  Jean-Kevin De La Motte <jean-kevin@debian.lab> 
To: <spam-jean-kevin@debian.lab>
Subject: This is Not a Spam
might be a troll, but a spam... no!

!DSPAM:4c874313289291828119542

250 2.0.0 Ok: queued as 42509114E28
quit
221 2.0.0 Bye

Now looking at the DSPAM logs for jean-kevin, we see that the message was 'retrained'.

1283936972      M       Jean-Kevin De La Motte <jean-kevin@debian.lab> 4c874313289291828119542
This is Not a Spam      Retrained <20100908090905.42509114E28@debian.lab>

DSPAM will then forward the message back to Postfix, where it will be delivered back to the user (the prefix is deleted). Text is, however, added at the end of the message informing the user that the message has been a re-trained.

These information messages need to be created (they are not ship with DSPAM). One for spam and one for the ham. This can be done as follows:

# echo 'Scanned and tagged as SPAM by DSPAM on Debian.Lab' > /var/spool/dspam/txt/msgtag.spam

# echo 'Scanned and tagged as HAM by DSPAM on Debian.Lab' > /var/spool/dspam/txt/msgtag.nonspam

- Training from the web interface

Using the web interface is necessary if the messages detected as spam are not sent to users but quarantined (Preferences “spamAction = quarantine”). Users must regularly check the interface to verify that no false positive is found in quarantine. Users can also use the interface to mark emails as spam or ham.

DSPAM sources provide a directory named 'webui'. This is a set of CGI scripts to control DSPAM through a web interface. No surprise, it's written in Perl. To run it, you have to configure {apache,lighttpd, nginx, …} to run perl CGI scripts.

<note>documentation already exists for apache and lighttpd, we chose to describe the configuration for Nginx.</note>

In fact, it's more complicated than that, because the CGI should be able to determine the identity of the user who connects. So, Nginx, in our case, will have to authenticate the user and forward their identity to DSPAM.

Nginx does not know how to run external scripts. The only thing it can do is send queries to a FastCGI socket. So we will need another program, which will stand between our Nginx and CGI scripts to execute them, this program is called 'fcgiwrap'.

We will also need some Perl packages required by DSPAM CGI (for parsing the HTML, display graphs with GD, etc. …).

Install the following packages:

# aptitude install nginx fcgiwrap libcgi-pm-perl libhtml-parser-perl libgd-graph-perl libgd-graph3d-perl

The DSPAM interface needs permissions to access '/var/spool/dspam' for both reading and writing, since it will change preferences and state of the dictionaries. Since fcgiwrap will be the process executing the Perl scripts, we will launch it as user/group 'dspam'.

We will also give world write access to the fcgiwrap socket so nginx can write to it.

<note>This is a test configuration, as the proverb says “Do not do this at home.”</note>

# vim /etc/init.d/fcgiwrap
[..]
FCGI_USER= »dspam »
FCGI_GROUP= »dspam »
[...]
# /etc/init.d/fcgiwrap restart
# chmod o+w /var/run/fcgiwrap.socket

Nginx configuration is then easy, it just forwards requests to CGI fcgiwrap. It must also authenticate users so that DSPAM can determine the identity of the visitor. This identity is stored in the variable REMOTE_USER, set by nginx and provided to fcgiwrap.

# vim /etc/nginx/sites-available/default
[...]
	location /dspam/cgi-bin {
		auth_basic      « DSPAM »;
		auth_basic_user_file  /var/www/dspam/passwords; 
		include /etc/nginx/fastcgi_params;
		index dspam.cgi;
		fastcgi_param  SCRIPT_FILENAME $document_root$fastcgi_script_name;
		fastcgi_param REMOTE_USER  $remote_user;
		if ($uri ~ « \.cgi$ »){
			fastcgi_pass  unix:/var/run/fcgiwrap.socket;
	             }
	}
# /etc/init.d/nginx restart

You must then create a file '/var/www/dspam/passwords', via the tool htpasswd. This file should contain one line per user, the username is the user's complete email address.

# htpasswd -c /var/www/dspam/passwords jean-kevin@debian.lab 
New password:
Re-type new password:
Adding password for user jean-kevin@debian.lab
# cat /var/www/dspam/passwords
jean-kevin@debian.lab:H2CigqsDz1U4E
# chown dspam:www-data /var/www/dspam/passwords 
# chmod o-rwx /var/www/dspam/password

The infrastructure is ready, copy the files from DSPAM sources 'webui' directory directly into the 'document root' of nginx.

# cp -r ~/dspam-3.9.1-RC1/webui/* /var/www/dspam/
# chown dspam:www-data /var/www/dspam -R

At this stage, we still have some configuration to do. The script '/var/www/dspam/cgi-bin/configure.pl' contains the configuration for the web interface to identify the directories of DSPAM. So check the values of $CONFIG{’DSPAM_HOME’}, $CONFIG{’DSPAM_BIN’}, etc, so that they corresponds to our environment.

$CONFIG{’DSPAM_HOME’}   = “/var/spool/dspam”; 
$CONFIG{’DSPAM_BIN’}    = “/usr/bin”;
[...]
$CONFIG{’WEB_ROOT’}     = “/dspam/htdocs/”;
[...]
$CONFIG{’LOCAL_DOMAIN’} = “debian.lab”;

With all this, we should be able to open the page http://myserver/dspam/cgi-bin/. Log in with user jean-kevin@debian.lab, and access the DSPAM interface. It allows, among other things, re-training of messages already processed from the tab 'History'.You can also change the preferences, etc.

The interface provides an administration section. To have access to it, you need to declare an admin in the file ‘/var/www/dspam/cgi-bin/admins’.

# echo ‘jean-kevin@debian.lab’ >> /var/www/dspam/cgi-bin/admin

We can then access the URL http://myserver/dspam/cgi-bin/admin.cgi and admire the beautiful graphics work, or change the default options.

Below are a few screenshots from the web interface:

The home page of the interface displays performances statistics for the current user.

The message history lists all inspected messages. It can be used to retrain a message.

This is a 14 days graph for a user receiving a lot of spam.

- Users management

When using virtual_ids, which is the most common method to manage last groups of users, the users and stored in the database. With Postgresql, you can browse the “dspam” database (as created previously) using standard sql commands:

postgres@server:/$ psql 
psql (8.4.8)
Saisissez « help » pour l'aide.

postgres=# \c dspam
psql (8.4.8)
Vous êtes maintenant connecté à la base de données « dspam ».

dspam=# \d
                    Liste des relations
 Schéma |          Nom           |   Type   | Propriétaire 
--------+------------------------+----------+--------------
 public | dspam_preferences      | table    | dspam
 public | dspam_signature_data   | table    | dspam
 public | dspam_stats            | table    | dspam
 public | dspam_token_data       | table    | dspam
 public | dspam_virtual_uids     | table    | dspam
 public | dspam_virtual_uids_seq | séquence | dspam
(6 lignes)


dspam=# \d dspam_virtual_uids
                                 Table « public.dspam_virtual_uids »
 Colonne  |          Type          |                          Modificateurs                           
----------+------------------------+------------------------------------------------------------------
 uid      | integer                | non NULL Par défaut, nextval('dspam_virtual_uids_seq'::regclass)
 username | character varying(128) | 
Index :
    "dspam_virtual_uids_pkey" PRIMARY KEY, btree (uid)
    "id_virtual_uids_01" UNIQUE, btree (username)
    "id_virtual_uids_02" UNIQUE, btree (uid)

As you can see in the description of the database above, there is a table called dspam_virtual_uids that will contain a simple mapping of the username with a generated id.

Here is how you can obtain the UID of a specific user.

dspam=# select * from dspam_virtual_uids where username = 'jean-kevin@debian.lab';
 uid |       username        
-----+-----------------------
   1 | jean-kevin@debian.lab
(1 ligne)

- Deleting a user

If, for any reason, you would like to remove a user from the database, you need to obtain its UID and then remove all rows from the others tables referencing this UID.

If you look at the description of the table dspam_token_data, for example, you will see that each token is attached to the UID of the user it belong too, thus making it extremely easy to identify and delete.

dspam=# \d dspam_preferences
         Table « public.dspam_preferences »
  Colonne   |          Type          | Modificateurs 
------------+------------------------+---------------
 uid        | integer                | 
 preference | character varying(128) | 
 value      | character varying(128) | 
Index :
    "dspam_preferences_uid_key" UNIQUE, btree (uid, preference)


dspam=# delete from dspam_token_data where uid in (select uid from dspam_virtual_uids where username = 'jean-kevin@debian.lab';
DELETE 2187

Repeat this step for all tables and delete the user from dspam_virtual_uids at the end.

Also, make sure to delete the user's data folder from /var/spool/dspam/data/<domain>/<user> if you really want to remove all traces of the user.

- Group management and inoculation

While DSPAM analysis focuses on the user, it also enables groups of users to share data. By properly defining these groups, for example through their activity, we can expect the content of messages to be similar and therefore the tokens statistics to be similar. Sharing this information helps to accelerate the DSPAM training.

We saw that each user has its dictionary of tokens in the file '<user>.css' (if using the Hash driver). This dictionary contains the tokens and associated statistics, produced by the user.

DSPAM can share these tokens and statistics in different ways (or types of groups).

Shared: group members share the same dictionary, but each member retains his own quarantine directory. Problem: If a user's behavior is different from the rest of the group, it will disrupt the whole group, starting with himself.

Shared,Managed: same as shared group, but with a single quarantine mailbox.

Classification: share the individual dictionaries. If the user's dictionary does not allow a user to determine if a message is spam or innocent (confidence <0.65 or dictionary containing less than 1000 innocent messages and 250 spam), the other group member dictionaries are used. The analysis stops when a class dictionary classifies the message. In practice, this group is a chain containing all users in the group, which is traversed linearly until a decision is reached. Each user should be listed as a member of the group for querying dictionaries of other members.

Global: an alternative to classification groups. This group type is used to define a Global classification group in which all members of the system can query dictionaries of members listed. If a user dictionary is not sufficient to classify a message, then it ask the opinion of the members of Global, by traversing the chain of members until a formal decision is reached. In short, Global is a sort of “council of wise men” that each user can query.

Merged: Merged assembles the user dictionary and the dictionary referenced to form one new dictionary and use it for analysis. New user specific tokens are always written back to the user dictionary. Training the Merged group alone (without the members) will influence the accuracy for each Merged group member.

Inoculation: This last group is somewhat unusual. It is the principle of vaccination, and allows a user having received spam not detected to inform all other users that this message is spam. Thus, each user has its own dictionary, which he uses exclusively for analysis, but users can exchange tokens between them. The first user is infected, the others are vaccinated. This principle of inoculation also allows user to define a bin, a honeypot for spam, which receive only spam and will thus accelerate the learning for everyone. This second mode is called 'external inoculation'.

- Setting up a group

Setting up a group is rather simple, the hardest part is to determine the correct group for your environment, and then to monitor the behavior over several weeks.

In our example we will implement a group of type “classification”. Since this type allows each user to retain his personal dictionary, it has little impact on the infrastructure (in case you want to delete the group).

DSPAM reads the group configuration from a text file located its 'Home Directory'. For us, that would be under '/var/spool/dspam/group'. The file contains one line per group in the form <groupName>:<type>:<user 1>, …, < user n>

We will create a group of type 'classification' including users jean-kevin, julien and root, and we will call this group 'class-debian-lab'.

# echo "class-debian-lab:classification:jean-kevin@debian.lab,julien@debian.lab,root@debian.lab" > /var/spool/dspam/group
# chown dspam:dspam /var/spool/dspam/group
# kill `pidof dspam`
# start-stop-daemon --start --chuid dspam --exec /usr/bin/dspam -- --daemon

By enabling the debug trace in dspam.conf (Directive 'Debug *' when dspam is compiled with debug mode), we can see the group being used in the file '/var/spool/dspam/log/dspam.debug'.

10150: [09/08/2010 14:43:19] user jean-kevin@debian.lab is member of classification group
class-debian-lab
10150: [09/08/2010 14:43:19] adding user julien@debian.lab to classification network group
10150: [09/08/2010 14:43:19] adding user root@debian.lab to classification network group

- Maintenance

- dspam_logrotate

This program provides log rotation for both system and DSPAM user logs (those stored in /var/spool/dspam).

The command can be run for a specific user or for all users in the dspam directory. In our case, we want to achieve rotation for everyone when logs exceed 60 days. We will therefore put the following in crontab:

30 5    * * *   dspam   /usr/bin/dspam_logrotate -a 60 -d /var/spool/dspam/data/

- Hash Driver cleanup

DSPAM's hash driver stores a large amount of information, be it for tokens or history. It therefore provides a tool to do some cleaning. 'dspam_clean' will clean up the dictionaries using the parameters defined in dspam.conf.

- dspam_clean

The default configuration for 'dspam_clean' is to retain all signatures for 14 days and clean the little used tokens after 15, 30 and 90 days depending on the type. Again, the configuration file provided by the sources is rather well commented.

#
# Purge configuration: Set dspam_clean purge default options, if not otherwise 
# specified on the commandline
#
PurgeSignatures 14      # Stale signatures
PurgeNeutral    90      # Tokens with neutralish probabilities
PurgeUnused     90      # Unused tokens
PurgeHapaxes    30      # Tokens with less than 5 hits (hapaxes)
PurgeHits1S     15      # Tokens with only 1 spam hit
PurgeHits1I     15      # Tokens with only 1 innocent hit

To achieve a periodic purge, add dspam_clean to the dspam user's cron. For example, with a command in /etc/crontab that starts every day at 5:

0  5    * * *   dspam   /usr/bin/dspam_clean -s -p -u

This command will perform the purge of the three types of information, including signatures, and neutral tokens that are not used.

- Databases cleanup

If you are using a database backend, and not the Hash driver, you need an external script to connect to the database and clean the tokens.

The script contrib/dspam_maintenance/dspam_maintenance.sh is written to connect to any of the 3 types of database backend DSPAM supports, and perform that cleanup for you.

'dspam_maintenance.sh' will read the Purge Configuration (as described above) from dspam.conf, connect to the backend and perform the cleanup. It requires to have an external set of queries for your database. This is database specific, and can be found in src/tools.<backend>

tools.mysql_drv
├── purge-4.1.sql
└── purge.sql

tools.pgsql_drv
├── purge-pe.sql
└── purge.sql

tools.sqlite_drv
├── purge-2.sql
└── purge-3.sql

Copy the proper set of queries in /var/spool/dspam and give the permissions to user 'dspam'.

# cp -r tools.pgsql_drv/ /var/spool/dspam/
# chown dspam:dspam /var/spool/dspam/tools.pgsql_drv/ -R

Now, copy the 'dspam_maintenance.sh' script to /etc/cron.daily/ (or cron.weekly if you prefer), and configure it as follow:

# cp contrib/dspam_maintenance/dspam_maintenance.sh /etc/cron.daily/dspam_maintenance
# chmod +x /etc/cron.daily/dspam_maintenance 
# vim /etc/cron.daily/dspam_maintenance 

[...]

DSPAM_CONFIGDIR="/etc/dspam"
DSPAM_HOMEDIR="/var/spool/dspam/"
DSPAM_PURGE_SCRIPT_DIR="/var/spool/dspam/tools.pgsql_drv/"
DSPAM_BIN_DIR="/usr/bin"
MYSQL_BIN_DIR="/usr/bin"
PGSQL_BIN_DIR="/usr/bin"
SQLITE_BIN_DIR="/usr/bin"
SQLITE3_BIN_DIR="/usr/bin"


[...]

<note>Remember that scripts in /etc/cron.* must not contain dots in their names (eg. no dspam_maintenance.sh, use dspam_maintenance)</note>

- Test Procedure

To conclude this section, we will demonstrate the test procedure specified in the README file of DSPAM. This procedure allows us to not only verify that the configuration is operational, but also to familiarize ourselves with the internal controls of DSPAM.

Step 1: Create a blank user

# useradd -d /home/michel-rene -U -m michel-rene
# passwd michel-rene

Step 2: Send an email to our current new user

# nc localhost 25 << EOF
ehlo mail
mail from:<jp.troll@gmail.com>
rcpt to:<michel-rene@debian.lab>
data
From: <jp.troll@gmail.com>
To: <michel-rene@debian.lab>
Subject: Cours message de test
10 mots c'est pas assez long pour un troll.
.
quit
EOF

Step 3: Check the statistics of the user account with the command dspam_stats

# dspam_stats michel-rene@debian.lab
michel-rene@debian.lab  TP:     0 TN:     1 FP:     0 FN:     0 SC:     0 NC:     0

Step 4: Check the list of tokens and the associated probabilities via dspam_dump

# dspam_dump michel-rene@debian.lab
4311867737599848632  S: 00000  I: 00001  P: 0.4000 LH: Wed Sep  8 21:20:22 2010
9486336444479993084  S: 00000  I: 00001  P: 0.4000 LH: Wed Sep  8 21:20:22 2010
18360635214432484661 S: 00000  I: 00001  P: 0.4000 LH: Wed Sep  8 21:20:22 2010
[…]

These tokens are associated with an innocent message, that is why the value S (spam) is zero and the value I (for Innocent) is one. Also, take note that the tokenizer 'OSB' creates 114 tokens for this small message (a few headers have been added, however, by Postfix). You can see the statistics associated with a particular token in the dictionary by entering its text at the command line. Obviously, with OSB as the tokenizer, the difficulty is knowing the original text of the token.

# dspam_dump michel-rene@debian.lab un+troll
1157728372545618534  S: 00000  I: 00001  P: 0.4000

# dspam_dump michel-rene@debian.lab assez+#+#+#+troll
695260355258399736   S: 00000  I: 000001  P: 0.4000

Step 5: Mark the message as Spam, for example in the web interface.

Step 6: Check the statistics of DSPAM user again:

# dspam_stats michel-rene@debian.lab
michel-rene@debian.lab  TP:     0 TN:     0 FP:     0 FN:     1 SC:     0 NC:     0

Step 7: Check the status of tokens again:

# dspam_dump michel-rene@debian.lab
4311867737599848632  S: 00001  I: 00000  P: 0.4000 LH: Wed Sep  8 21:28:31 2010
9486336444479993084  S: 00001  I: 00000  P: 0.4000 LH: Wed Sep  8 21:28:31 2010
18360635214432484661 S: 00001  I: 00000  P: 0.4000 LH: Wed Sep  8 21:28:31 2010
[…]

The update completed correctly, these tokens are now associated to spam (S is 1, I is zero).

These few commands can not only control that our anti-spam is functional, but also follow the lifecycle of tokens over time.

- Conclusion

Our tour of DSPAM is complete. I have not really talked about success rates and other criteria generally used to classify antispam solutions, for two reasons: firstly, these figures are generally lying, and the results depend heavily on user behavior, so it is difficult to get reproductible figures. And second: there is no real point, today, to have an infrastructure based on a single anti-spam product. Integrating a system with Postfix greylist is trivial, and it is even possible to combine SpamAssassin and DSPAM one behind the other (just call a spamassassin content-filter after returning to Postfix from DSPAM).

So in the end, the best way to fight to use multiple techniques, but what we have seen in these pages is that DSPAM is a great tool for this work. It can be a bit difficult to pick up initially, but the result and the flexibility of the product is well worth the initial investment.

Julien Vehent, and the DSPAM team - 2011

~~DISCUSSION:off~~

Table of Contents