Differences

This shows you the differences between two versions of the page.

Link to this comparison view

en:ressources:dossiers:dspam [2011/10/19 12:46] (current)
Line 1: Line 1:
 +======Effectively fighting spam with DSPAM ======
  
 +Have you ever tasted spam? I mean the real spam. The kind served in vaguely rectangular boxes and is the color of ham. It was served to me once on Christmas Eve, and I can assure you that it's really not good! But the spam we'll talk about here is different, more electronic, but just as disgusting. 
 +
 +Fighting spam is probably the most complex task of a postmaster. The techniques are numerous, and it is essential to multiply them to get a relevant result. DSPAM is a statistical engine that analyzes a text and produces a probability based on the content. The internal mechanisms of DSPAM are a bit tricky to understand, that is why we recommend you spend some time reading this documentation (and especially chapter 3: configuration) while preparing your setup.
 +
 +We will discuss the implementation of DSPAM with a Postfix SMTP server. We will rely of the source code for the installation,​ but you should probably check your distribution'​s repository for existing packages.
 +
 +===== - Overview =====
 +DSPAM was originally written by Jonathan Zdziarski, an American developer, in 2003, following his research on the classification of spam. DSPAM was subsequently sold in 2006 to Sensory Networks. Since the beginning DSPAM was released under GPLv2 or later and got changed to GPLv2 in 2009. Since 2009, DSPAM is maintained by a small group of developers.
 +
 +DSPAM is mainly written in C and requires a backend to store the data. Drivers for MySQL, PostgreSQL and SQLite are available; it is also possible to rely on a Hash Driver that create data files on disk, for use without a relational database. This "Hash Driver"​ is the default option and used to be the fastest, but nowadays a PostgreSQL backend is preferred.
 +
 +DSPAM produces statistical data for each user because this approach is proved to be more efficient than having a global ruleset for all users (see //​Technology//​ below). Thus,​ each address of the domain (as <​user>​ @ <​domain>​ or normal flat names as <​user>​) has tokenized data in one of the supported storage engines (MySQL, PostgreSQL, SQLite or Hash Driver) and some additional data like logs, statistics for the Web-UI, quarantine, corpi, preferences,​ etc in DSPAM'​s data directory. It is possible to share information between multiple users in the form of groups. There are several types of groups, which we will detail later.
 +==== - Technology ====
 +Back in 2002, Paul Graham, another U.S. developer, published "A Plan for Spam", an article that changed the way people analyze spam. In the early days antispam rules were largely based on criteria specific to spam such as "under capitalized"​ or "​contains 48 exclamation points"​. Paul Graham has worked on such rules, which worked pretty well, but the problem was that a low percentage of false positives result, on the order of 1 to 2%, was extremely difficult to filter. 
 +
 +His idea, which others had before him without the same influence, is to divide the content of an e-mail into tokens; a token being typically a word, a component of headers, an html tag, etc. ... and calculate statistics on them using the Bayes algorithm. The results being very satisfactory,​ Graham'​s technique has become the norm and is at the heart of DSPAM.
 +
 +Graham has also shown that, to be truly effective, statistics should be produced for each user individually. Using a global basis for all users is less effective, since some words commonly used by a user may be considered spam to another (those of you who work for pharmaceutical companies certainly understand the principle). 
 +===== - Installation =====
 +DSPAM source code is available on http://​dspam.sourceforge.net/,​ and the latest version available at this writing: dspam-3.9.1-RC1. 
 +
 +The archive contains the documentation,​ which is sorely lacking on the wiki. In fact, the quite detailed README files (and this document) forms the core of what one has to know about DSPAM. 
 +
 +Before starting the compilation,​ let us define what we want to do: 
 +  * DSPAM must interface with Postfix (as content-filter),​ and will therefore receive and reinsert the email via TCP sockets on localhost. DSPAM should run in daemon mode.
 +  * Each user is identified via their email address in full 
 +  * Each user will have their own dictionary of tokens and associated statistics.
 +
 +DSPAM does not really have external dependencies. A fresh install of Linux with a few tools installed (gcc, make, the backend libraries, …) is enough to build it. It is also important to run the daemon with limited privileges (eg. as user '​dspam'​).
 +<​note>​ The following configuration options fit well the folder organization of a Debian system. You will have to adapt them to your own setup.</​note> ​
 +<​code>​
 +$ su
 +# useradd -r -s /bin/false -U -d /​var/​spool/​dspam dspam
 +# exit
 +$ ./configure --enable-daemon --enable-split-configuration --enable-syslog --enable-clamav --enable-preferences-extension --enable-domain-scale --with-dspam-home=/​var/​spool/​dspam --with-dspam-home-owner=dspam --with-dspam-home-group=dspam --with-dspam-owner=dspam --with-dspam-group=dspam --with-storage-driver=hash_drv --prefix=/​usr/​local/​dspam --sysconfdir=/​etc/​dspam --mandir=/​usr/​share/​man --bindir=/​usr/​bin --sbindir=/​usr/​sbin --libdir=/​usr/​lib --includedir=/​usr/​include
 +$ make
 +$ su
 +# make install
 +</​code>​
 +The compilation options are detailed in '​./​configure --help'​. You can also enable debugging with the following options '​--enable-debug --enable-bnr-debug –enable-verbose-debug',​ but beware of the amount of logs produced (in /​var/​spool/​dspam/​log).
 +If you want to compile DSPAM with support for PostgreSQL as a storage backend (instead of the hash driver), you can use the following configuration parameters:​ 
 +<​code>​
 +$ ./configure --enable-daemon --enable-split-configuration --enable-syslog --enable-clamav --enable-preferences-extension --enable-domain-scale --with-dspam-home=/​var/​spool/​dspam --with-dspam-home-owner=dspam --with-dspam-home-group=dspam --with-dspam-owner=dspam --with-dspam-group=dspam --with-storage-driver=pgsql_drv --with-pgsql-includes=/​usr/​include/​postgresql/​ --with-pgsql-libraries=/​usr/​lib/​ --enable-virtual-users --prefix=/​usr/​local/​dspam --sysconfdir=/​etc/​dspam --mandir=/​usr/​share/​man --bindir=/​usr/​bin --sbindir=/​usr/​sbin --libdir=/​usr/​lib --includedir=/​usr/​include --enable-debug --enable-bnr-debug --enable-verbose-debug
 +</​code>​
 +<​note>​To build DSPAM with the Postgresql backend, you need the psql libraries (packages libpq5 and libpq-dev in Debian Squeeze).</​note>​
 +Off course it is possible to compile DSPAM with support for more than one storage backend. To do so you can use the following configuration parameters:
 +<​code>​
 +$ ./configure --enable-daemon --enable-split-configuration --enable-syslog --enable-clamav --enable-preferences-extension --enable-domain-scale --with-dspam-home=/​var/​spool/​dspam --with-dspam-home-owner=dspam --with-dspam-home-group=dspam --with-dspam-owner=dspam --with-dspam-group=dspam --with-storage-driver=hash_drv,​pgsql_drv --with-pgsql-includes=/​usr/​include/​postgresql/​ --with-pgsql-libraries=/​usr/​lib/​ --enable-virtual-users --prefix=/​usr/​local/​dspam --sysconfdir=/​etc/​dspam --mandir=/​usr/​share/​man --bindir=/​usr/​bin --sbindir=/​usr/​sbin --libdir=/​usr/​lib --includedir=/​usr/​include --enable-debug --enable-bnr-debug --enable-verbose-debug
 +</​code>​
 +
 +As you see, we have enabled the Hash driver and support for PostgreSQL. As soon as you use more than one storage backend, DSPAM will compile them in separate shared library files (libpgsql_drv.so for PostgreSQL and libhash_drv.so for the Hash driver) and allow you to choose inside dspam.conf which storage engine you would like to use.
 +===== - Configuring DSPAM =====
 +Before feeding DSPAM with the flow of emails from Postfix, we will configure and test it. 
 +The default **dspam.conf** configuration file comes with a large number of comments, but is not all that easy to interpret without a careful reading of the README and this documentation. 
 +**dspam.conf** is pre-filled with the parameters from the **./​configure** command. It contains the configuration options related to the chosen backend database, the home folder, and so on.
 +<​file>​
 +[…]
 +
 +# DSPAM Home: Specifies the base directory to be used for DSPAM storage ​
 +
 +Home /​var/​spool/​dspam ​
 +[…]
 +
 +#​StorageDriver /​usr/​lib/​dspam/​libhash_drv.so ​
 +StorageDriver /​usr/​lib/​dspam/​libpgsql_drv.so ​
 +[…]
 +</​file>​
 +
 +If you have selected during configure just one storage driver then you don't need to specify in dspam.conf which one. DSPAM will automatically know what storage driver you configured and will use it.
 +
 +More on the Storage backends in the [[en:​ressources:​dossiers:​dspam#​storage_driver%C2%A0]] section.
 +
 +==== - Communication with the SMTP server ====
 +As said earlier, we want Postfix to communicate with DSPAM using TCP sockets. This setup requires two separates communications:​
 +  * submit to dspam [SMTP server to DSPAM]:
 +    * DSPAM will listen on the chosen TCP port and wait for connections coming from the SMTP server
 +  * response from DSPAM [DSPAM to SMTP server]:
 +    * After analyzing the message, DSPAM sends it back to the SMTP server.
 +
 +The submission socket will receive messages from Postfix. It listens on port TCP/10033 (arbitrary choice) and will speak LMTP (LMTP is a lightweight version of SMTP for intra-infrastructure mail transport). 
 +<​file>​
 +ServerPort 10033
 +ServerQueueSize 32
 +ServerPID /​var/​run/​dspam/​dspam.pid
 +ServerMode auto
 +ServerParameters "​--deliver=innocent,​ spam –d %u"
 +ServerIdent "​localhost.localdomain"​
 +</​file>​
 +The directive **ServerParameters** tells DSPAM reinject innocent emails and spam, as opposed to keeping spam in quarantine. While testing your setup, it is better to forward suspected spam to the user's mailbox, and filter them using a mark on the Subject and/or on the headers, rather than quarantining them directly (note that it is possible to send a list of quarantined messages to your users daily). 
 +
 +DSPAM will then connect to Postfix and reinject the email after analysis. The following parameters connect back to Postfix on port TCP/10034 (postfix needs to be configured as well, we'll discuss that later).
 +<​file>​
 +DeliveryHost 127.0.0.1
 +DeliveryPort 10034
 +DeliveryIdent localhost
 +DeliveryProto SMTP
 +</​file>​
 +Note that we speak SMTP here and not LMTP anymore.
 +==== - Mode of learning ====
 +DSPAM starts its operations with empty dictionaries. This means that during the first weeks, DSPAM will learn a lot and filter little (and progressively inverse that).
 +
 +It also means that it is the responsibility of the users to mark emails as spam (or ham if DSPAM mistakenly marks a message as spam). It is left to the postmaster to provide to its users a simple way to mark emails. 
 +
 +Several learning methods exist and are described in the man page of DSPAM. The one that interests us here is called '​teft'​ and forces DSPAM to learn about each email it processes, innocent and spam. 
 +
 +This mode is particularly intensive because it goes through every email and creates or updates all the tokens created from the message in the user dictionary. It'​s perfect for a new user who needs to quickly build up a dictionary, but may consume too much CPU in a busy environment. 
 +To use teft mode, set the following directive in **dspam.conf**:​ 
 +<​file>​
 +TrainingMode teft 
 +</​file>​
 +To overcome the problem of performance,​ other modes of learning exist. Mode '​tum',​ for example, learns on all message as well, but only for limited period of time (called training) and will only update the dictionary upon user interaction afterward.
 +
 +This parameter can be set for each user separately, as we will see in the preferences. The default mode is the one set in dspam.conf. 
 +
 +==== - Method of Detection ====
 +We are now at the core of DSPAM: the mode of detection. DSPAM is essentially a statistical analysis composed of 3 sub-parts:
 +  - Content tokenizing ​
 +  - Statistical algorithm 
 +  - Calculation of probability 
 +
 +
 +=== - Content Tokenizing ===
 +
 +This is the module that will break up content, making each piece into a token, and store the token'​s unique hash in the user dictionary. These tokens can be of several forms depending on the mode chosen, the most basic being to take the words one by one, every word is a new token.
 +
 +But there are also more advanced modules, capable of taking into account different parts of each sentence. For those who like the Germanic prose, here's how a sentence will be cut by the different modules: 
 +
 +“Heute Abend war ich mit meiner Freundin im Kino und habe viel gelacht”
 +
 +The character '​+'​ means a combination of words, the character '#'​ denotes a word not taken into account. 
 +
 +**WORD module**: each word represents a token, it has 13 tokens. 
 +<​file>​
 +TOKEN: ‘Heute’ CRC: 6716984897371635712
 +TOKEN: ‘Abend’ CRC: 6670531613365895168
 +TOKEN: ‘war’ CRC: 4772677679197454336
 +TOKEN: ‘ich’ CRC: 6329956816985784320
 +[...]
 +</​file>​
 +**CHAIN ​​module**:​ the word is related to the word that follows, we therefore have one token less, or 12 tokens. 
 +<​file>​
 +TOKEN: ‘Heute+Abend’ CRC: 9299536586222406967
 +TOKEN: ‘Abend+war’ CRC: 5205867775940263209
 +TOKEN: ‘war+ich’ CRC: 6329956649787979024
 +TOKEN: ‘ich+mit’ CRC: 5158416839735805488
 +[...]
 +</​file>​
 +
 +**Module OSB** (Orthogonal Sparse bigram): for each word, it creates a sliding window of 5 words around the word. So we will associate the word with a neighbor over a radius of -4 / +4 positions around the word.
 +<​file>​
 +TOKEN: ‘Heute+#​+#​+#​+mit’ CRC: 2006452661602586241
 +TOKEN: ‘Abend+#​+#​+mit’ CRC: 5482652074219693289
 +TOKEN: ‘war+#​+mit’ CRC: 15707817493435847227
 +TOKEN: ‘ich+mit’ CRC: 5158416839735805488
 +TOKEN: ‘Abend+#​+#​+#​+meiner’ CRC: 8544044731047037263
 +TOKEN: ‘war+#​+#​+meiner’ CRC: 14722667808637756004
 +[...]
 +</​file>​
 +**SBPH module** (Sparse Binary Polynomial Hashing): similar to OSB, but more flexible, because we will use a sliding window of 5 words, but also consider the intermediate words in the window, and not just ignore them (represented by a '#​ '​in OSB). 
 +<​file>​
 +TOKEN: ‘mit’ CRC: 5158417007107899392
 +TOKEN: ‘ich+mit’ CRC: 5158416839735805488
 +TOKEN: ‘war+#​+mit’ CRC: 15707817493435847227
 +TOKEN: ‘war+ich+mit’ CRC: 6905336139605378569
 +TOKEN: ‘Abend+#​+#​+mit’ CRC: 5482652074219693289
 +TOKEN: ‘Abend+#​+ich+mit’ CRC: 2006454003823721484
 +</​file>​
 +Obviously, the dictionnary of SBPH is a lot larger than the one of OSB, which is in turn larger than CHAIN and WORD.
 +
 +The advantage of tokenizers such as OSB and SBPH is that they can identify phrases that  have never been seen before, by using combination ('​+'​ character) and jumping (character'#'​). 
 +
 +For example, suppose the token "**Buy Viagra + # +**"​. This token is only able to identify phrases such as: 
 +<​file>​
 +Buy Viagra cheep 
 +Buy Viagra good 
 +Buy Viagra Herbal
 +Buy Viagra exclusive 
 +Buy Viagra boosting 
 +Buy Viagra fresh
 +Buy Viagra qualitative 
 +</​file>​
 +In the same situation, the WORD tokenizer is able to identify individual words only , and not their combination. The CHAIN tokenizer wouldn’t do almost anything ... unless you have all the combinations in the dictionary.
 +
 +SBPH mechanism also uses a weight in association to the tokens. Thus,​ a token of 5 words will have a much greater weight than a token of only one word, according to the formula: weight = 2 ^ (2 * n), where n represents the number of neighboring words taken into account. 
 +
 +Still juggling with our Germanic prose, the weight table for the sentence "Heute Abend war ich mit" is as follows: 
 +^Token^Weight^
 +|Heute|1|
 +|Heute+Abend|4|
 +|Heute+#​+war|4|
 +|Heute+Abend+war|16|
 +|Heute+#​+#​+ich|4|
 +|Heute+Abend+#​+ich|16|
 +|Heute+#​+war+ich|16|
 +|Heute+Abend+war+ich|64|
 +|Heute+#​+#​+#​+mit|4|
 +|Heute+Abend+#​+#​+mit|16|
 +|Heute+#​+war+#​+mit|16|
 +|Heute+Abend+war+#​+mit|64|
 +|Heute+#​+#​+ich+mit|16|
 +|Heute+Abend+#​+ich+mit|64|
 +|Heute+#​+war+ich+mit|64|
 +|Heute+Abend+war+ich+mit|256|
 +The weight is then used to multiply the impact of the token when calculating probability. 
 +
 +For our configuration,​ we are going to use the OSB tokenizer. But this should not prevent you from experimenting with SBPH. However, unless you have very specific needs, CHAIN and WORD are probably too primitive for you.
 +
 +The directive to be placed in '​dspam.conf is therefore:​ 
 +<​file>​
 +Tokenizer osb 
 +</​file>​
 +
 +=== - Statistical Algorithm ===
 +With all these tokens, the challenge is to determine which ones influence the decision, and in what proportions. As we have said, DSPAM does not come with a pre-filled dictionary. It cannot tell immediately if the token '​+#​+#​+Abendmit'​ is relevant to determining whether the message is spam or not. But it learns, and it will adjust the probabilities associated with each token, and apply it to new emails. 
 +
 +Beyond the mere calculation of probability,​ the statistical algorithm used to define criteria needs to be taken into account when calculating spam probability. DSPAM gives us the choice between several statistical algorithms that are: 
 +
 +  * naive: Naive Bayesian-(All Tokens) 
 +  * graham Graham-Bayesian ("A Plan for Spam"​) 
 +  * burton Burton-Bayesian (SpamProbe) 
 +  * Chi-square Fisher-Robinson'​s Chi-Square Algorithm 
 +
 +It is also possible to combine these algorithms. A combination graham+burton generally shows a good false positive / false negative. So that's what we will use, via the following directive:​ 
 +<​file>​
 +Algorithm graham burton 
 +</​file>​
 +But here is a little background to understand the mechanics behind this. The naive approach (of the algorithm of the same name) considers all the tokens composing a message. Each token is initialized with a statistical neutral value, or 0.5, which means neither spam or ham (a value closer to 1 equal spam). But the problem with this simple operation is that a spammer could include a long text containing common words ("​and",​ "​hello",​ ...), and one or two sentences containing the spam message, and the algorithm will process all tokens at the same level, allowing the “support” text to reduce the likelihood of the final message being spam. 
 +
 +This approach was discussed by Paul Graham, him again, to demonstrate that a more optimal solution is possible. Therefore,​ the Graham algorithm uses the following criteria: 
 +
 +  - Analyze the message and selects the 15 most relevant tokens. The tokens selected are those with the highest deviation from the neutral probability 0.5. 
 +  - Ignore tokens that have been seen less than 5 times in the past. 
 +  - Use tokens only once. If a token is present twice in the message, the second occurrence will not be taken into account in the calculation. 
 +  - When adding new tokens, define an initial probability of 0.4 instead of 0.5. This allows DSPAM to biased tokens toward innocence, until proven guilty. 
 +
 +Brian Burton uses a modified version of Graham. The number of tokens considered is increased from 15 to 27, and if a token is relevant several times, then it will be taken into account several times. This algorithm has been first incorporated into the spam software SpamProbe. 
 +
 +Still in the movement of the early 2000s, Gary Robinson published in a 2003 issue of the Linux Journal an improved version of the algorithm of Graham. His own version is at the heart of the SpamBayes project, another classification engine, and presents a more efficient process for tokens that appear infrequently. His approach is based on the statistical test Chi-Square, hence the name of the directive in DSPAM. 
 +
 +It's hard to say which of these algorithms is most appropriate. All are achieving excellent results; feel free to experiment with all of them. 
 +
 +=== - Calculation of probabilities ===
 +So we have tokens, whose initial value is 0.4 using the Graham algorithm, and calculation parameters. 
 +
 +The last step is to calculate the probability that a message is spam or not. DSPAM uses what is called the pValue, and provides three algorithms to perform this calculation. 
 +
 +These statistical algorithms are :
 +  * markov: from the Russian mathematician Andrey Markov
 +  * robinson: from Gary Robinson
 +  * bcr: Bayesian Chain Rule, by Paul Graham
 +
 +The standard algorithm, the one we will use in our example, is '​bcr'​ Bayesian Chain Rule, which is also the algorithm described by Paul Graham in his article "A Plan for Spam". To use it, set the following parameter in **dspam.conf**:​
 +<​file>​
 +Pvalue bcr 
 +</​file>​
 +
 +== - Bayesian filtering ==
 +
 +When talking about anti-spam technologies,​ the work of Thomas Bayes is consistently cited. Bayes'​ theorem is to calculate the probability of occurrence of an event using its recorded occurrence in the past. In other words, it's what we call “experience”. 
 +
 +In our case, the formula to calculate the probability that a message is spam or not is: 
 +
 +**P = S / (S + H) **
 +
 +With: 
 +  * P: (P) robability of the message being spam 
 +  * S: product of the probabilities associated to each token composing the message: **S=P(token-1) * P(token-2) * ... *P (token-n) **
 +  * H: inverse probability of the tokens: **H=(1 - P (token-1)) * (1-P (token-2)) * ... * (1-P (token-n))**
 +
 +As we saw, when token is added to the dictionary, it takes a default value of 0.4. Whenever DSPAM learns about a message containing the same token, it changes the value. Therefore,​ if the word "​Viagra"​ is present in 10 messages, and 9 are spam, the probability associated with this token will be: **P (Viagra) = 9 / (9 +1) = 0.9 **
 +
 +Consider the message "​Hi! Buy Viagra.” We will apply a WORD tokenizer to this message (WORD is easier to handle for the example). 
 +
 +The first thing the tokenizer does is to remove the characters not taken into account, such as the exclamation point. The message is then "Hi Buy Viagra” 
 +
 +Each word is a token of its own, one can imagine that the dictionary user is in the following state: 
 +^Token^Nb de Spam (s)^Nb de Ham (h)^Probability p=s/(s+h)^
 +|Hi|25|62|0.29|
 +|Buy|157|87|0.64|
 +|Viagra|231|11|0.95|
 +
 +We can calculate the final pvalue of the message with the Bayes formula: 
 +  * S = 0.29 * 0.64 * 0.95 = 0.176 
 +  * H = (1-0.29) * (1-0.64) * (1-0.95) = 0.71 * 0.36 * 0.05 = 0.0127 
 +  * Pvalue = S / (S + H) = 0,176 / (0,176 + 0.0127) = 0.93 
 +So the final probability of the message being spam is 93%. 
 +
 +== - Markov ==
 +
 +However, in the particular case where the tokenizer is SBPH, it is possible to use the concept of weight of the tokens in the statistical calculation. That is what the '​markov'​ method does, but it only works if SBPH is enabled (only this tokenizer keeps weight associated to the tokens). 
 +Markov is an improved version of **bcr** that can tell if a token is very specific (eg, 5 of 5 words) and multiply its impact on the pValue (256 times greater than a token with a single word). In short, the weight value of the token is used to multiply the impact of the probability associated with the token in the overall calculation. 
 +
 +== - Confidence ==
 +
 +DSPAM exports a confidence value of the result produced. The confidence is calculated based on the likelihood that the message is spam or not.
 +
 +When the message is innocent, the closer the value is to zero, the more confident DSPAM is in its result: confidence is high. Thus,​ if the message is innocent, confidence equals (1 - probability). (Example:​ probability = 0.0184, confidence = 1 - 0.0184 = 0.9816).
 +
 +When the message is spam, the closer the value is to 1, the more confident DSPAM is in its outcome. Thus,​ if the message is spam, trust equals probability.
 +
 +That's all for the mathematics. If you like this topic, do not hesitate to continue the discussion on the mailing list.
 +==== - Storage driver ====
 +=== - Using the Hash Driver ===
 +The default and most straightforward backend to configure is the Hash Driver. It maintains per-user dictionaries of tokens in the user's folder.
 +To use this driver, set the following parameter at the beginning of dspam.conf.
 +<​code>​
 +StorageDriver /​usr/​lib/​dspam/​libhash_drv.so  ​
 +</​code>​
 +<​note>​The examples in this document are based on the Hash Driver, but can easily be transferred to any other backend.</​note>​
 +Tokens that DSPAM generates take up space, lots of space. When using the Hash driver, DSPAM can set the maximum size of the hash file that each user will use (its dictionary),​ and with a tokenizer such as OSB, you must ensure it will be large enough.
 +
 +For example, a rather active account, receiving between 200 and 300 messages a day will generate roughly 2.5 million tokens in the space of two weeks. Obviously,​ this value will vary greatly depending on whether the messages contain the same tokens or not.
 +
 +By setting the value of '​HashRecMax'​ to over 6 million entries, it gives some leeway to DSPAM, but we will however give it the possibility of increasing this value up to 16 million (in increments of 50000), just in case.
 +<​file>​
 +HashRecMax 6291469 
 +HashAutoExtend on
 +HashMaxExtents 10000000
 +HashExtentSize 49157
 +</​file>​
 +It also means that the file hash of a user will be initialized with a size close to 100MB! This can be a problem on a system managing a large number of users.
 +
 +=== - Using the Postgresql Driver ===
 +
 +While the examples in this documentation are mostly based on the Hash Driver, you will probably chose to use another type of backend. DSPAM works extremely well with a Postgresql backend, and this is the recommended setup.
 +
 +To use the Postgresql driver, set the following parameter at the beginning of dspam.conf:
 +<​code>​
 +StorageDriver /​usr/​lib/​dspam/​libpgsql_drv.so
 +</​code>​
 +
 +Let's take a closer look at the configuration procedure.
 +
 +== - Granting access to the database ==
 +Assuming Postgresql (v8.4 in this example) is installed, the first step is to create a database for dspam, and grant access to the user '​dspam'​.
 +
 +In the command line, create and empty database named **dspam**:
 +<​code>​
 +# su postgres
 +postgres@server:/​$ psql 
 +psql (8.4.7)
 +Type "​help"​ for help.
 +postgres=# create role dspam login;
 +CREATE ROLE
 +postgres=# alter role dspam password '​309dj20ejd903j';​
 +ALTER ROLE
 +postgres=# create database dspam owner dspam;
 +CREATE DATABASE
 +</​code>​
 +
 +Then edit **/​etc/​postgresql/​8.4/​main/​pg_hba.conf** to grant access to user dspam:
 +<​code>​
 +ramiel:/​home/​julien/​dspam#​ cd /​etc/​postgresql/​8.4/​main/​
 +ramiel:/​etc/​postgresql/​8.4/​main#​ vim pg_hba.conf
 +
 +[...]
 +
 +# TYPE  DATABASE ​   USER        CIDR-ADDRESS ​         METHOD
 +local   ​dspam ​      ​dspam ​                            ​password
 +</​code>​
 +
 +You can then connect to postgres from user dspam (make sure user dspam as a login shell such as /bin/bash in /​etc/​passwd,​ otherwise '​su'​ won't work).
 +
 +<​code>​
 +server:/# su dspam
 +dspam@server:/​$ psql -d dspam -U dspam -h localhost
 +psql (8.4.7)
 +Type "​help"​ for help.
 +
 +dspam=> \du
 +            List of roles
 + Role name | Attributes ​ | Member of 
 +-----------+-------------+-----------
 + ​dspam ​    ​| ​            | {}
 + ​postgres ​ | Superuser ​  | {}
 +           : Create role   
 +           : Create DB     
 +
 +dspam=> \q
 +dspam@server:/​$ ​
 +</​code>​
 +
 +You can try to create a test table to check that **dspam** user has the appropriate permissions:​
 +<​code>​
 +dspam=> create table test (test int);
 +CREATE TABLE
 +dspam=> \d
 +       List of relations
 + ​Schema | Name | Type  | Owner 
 +--------+------+-------+-------
 + ​public | test | table | dspam
 +(1 row)
 +
 +dspam=> drop table test;
 +DROP TABLE
 +</​code>​
 +
 +== - Create the database schema ==
 +The database schemas are located in the source code of DSPAM, in the folder **src/​tools.pgsql_drv**.
 +
 +However, before imported the schemas, we are going to create a procedural language in the DSPAM database. This is done using the command below:
 +<​note>​The createlang command is a shell command, you need to execute this on the command line of your server, not in the postgresql prompt.</​note>​
 +<​code>​
 +dspam@server:/​$ createlang plpgsql dspam
 +</​code>​
 +
 +Now go back to the Postgresql prompt and import the schemas **pgsql_objects.sql** and **virtual_users.sql**:​
 +<​code>​
 +dspam=> \i /​home/​julien/​dspam-3.9.1-RC1/​src/​tools.pgsql_drv/​pgsql_objects.sql
 +
 +[... tables and sequences creation output ...]
 +
 +dspam=> \i /​home/​julien/​dspam-3.9.1-RC1/​src/​tools.pgsql_drv/​virtual_users.sql
 +</​code>​
 +<​note>​You might receive some warnings when the import scripts try to perform and '​analyze'​ and doesn'​t have the permissions to do so. You can safely ignore this.</​note>​
 +The dspam database should then be in the following state (tables and indexes):
 +<​code>​
 +dspam=> \d
 +                 List of relations
 + ​Schema |          Name          |   ​Type ​  | Owner 
 +--------+------------------------+----------+-------
 + ​public | dspam_preferences ​     | table    | dspam
 + ​public | dspam_signature_data ​  | table    | dspam
 + ​public | dspam_stats ​           | table    | dspam
 + ​public | dspam_token_data ​      | table    | dspam
 + ​public | dspam_virtual_uids ​    | table    | dspam
 + ​public | dspam_virtual_uids_seq | sequence | dspam
 +(6 rows)
 +
 +dspam=> \di
 +                              List of relations
 + ​Schema |             ​Name ​            | Type  | Owner |        Table         
 +--------+------------------------------+-------+-------+----------------------
 + ​public | dspam_preferences_uid_key ​   | index | dspam | dspam_preferences
 + ​public | dspam_signature_data_uid_key | index | dspam | dspam_signature_data
 + ​public | dspam_stats_pkey ​            | index | dspam | dspam_stats
 + ​public | dspam_token_data_uid_key ​    | index | dspam | dspam_token_data
 + ​public | dspam_virtual_uids_pkey ​     | index | dspam | dspam_virtual_uids
 + ​public | id_virtual_uids_01 ​          | index | dspam | dspam_virtual_uids
 + ​public | id_virtual_uids_02 ​          | index | dspam | dspam_virtual_uids
 +(7 rows)
 +</​code>​
 +
 +== - Configure DSPAM to connect to Postgresql ==
 +
 +The last step is simply to feed **dspam.conf** with the parameters to connect to the database. The configuration file comes with a Postgresql section where you can uncomment the configuration parameters and set the proper values:
 +
 +<​file>​
 +# --- PostgreSQL ---
 +
 +# For PgSQLServer you can Use a TCP/IP address or a socket. If your socket is
 +# in /​var/​run/​postgresql/​.s.PGSQL.5432 specify just the path where the socket
 +# resits (without .s.PGSQL.5432).
 +
 +PgSQLServer ​   127.0.0.1
 +PgSQLPort ​     5432
 +PgSQLUser ​     dspam
 +PgSQLPass ​     309dj20ejd903j
 +PgSQLDb ​       dspam
 +
 +# If you're running DSPAM in client/​server (daemon) mode, uncomment the
 +# setting below to override the default connection cache size (the number
 +# of connections the server pools between all clients).
 +#
 +PgSQLConnectionCache 3
 +
 +</​file>​
 +
 +Upon restart, DSPAM will create 3 connections to the Postgresql database.
 +<​code>​
 +dspam    19333     ​1 ​ 0 Apr01 ?        00:16:21 /​usr/​bin/​dspam --daemon
 +postgres 19334  9851  0 Apr01 ?        00:03:12 postgres: dspam dspam 127.0.0.1(57278) idle                                                                                 
 +postgres 19337  9851  0 Apr01 ?        00:49:12 postgres: dspam dspam 127.0.0.1(57279) idle                                                                                 
 +postgres 19341  9851  0 Apr01 ?        00:06:11 postgres: dspam dspam 127.0.0.1(57280) idle   
 +</​code>​
 +==== - Whitelist====
 +DSPAM has the opportunity to observe the sender of messages for a given recipient, and create a whitelist of senders that have sent more than 20 emails where none have been flagged as spam. This feature, quite handy, does not need any other configuration than: 
 +<​file>​
 +Feature whitelist
 +</​file>​
 +==== - The preferences====
 +Each user can parameter its own preferences via the web interface (we will install it later). However,​ it is possible to set default values for those preferences.
 +
 +For example, the default configuration does not deliver spam to users, but place them in quarantine. To change this behavior, we modify the following parameters in **dspam.conf**:​
 +<​file>​
 +Preference "​spamAction=tag" ​    # { quarantine | tag | deliver } -> default:​quarantine
 +Preference "​spamSubject=[SPAM]"​ # { string } -> default:​[SPAM]
 +Preference "​tagSpam=on" ​        # { on | off }
 +Preference "​tagNonspam=off" ​    # { on | off }
 +</​file>​
 +
 +There are many of those preferences,​ You can decide to leave the possibility to the users to modify them by setting:
 +<​file>​
 +AllowOverride spamAction
 +AllowOverride spamSubject
 +AllowOverride tagSpam
 +AllowOverride tagNonspam
 +</​file>​
 +It is also possible to remove the DSPAM signature from messages via this preference:
 +<​file>​
 +Preference “signatureLocation=message” ​ # { message | headers } -> default:​message
 +</​file>​
 +However, this signature is quite handy for re-training messages, as we shall see later. So it's recommended to leave it until you have a better solution to retrain spam.
 +
 +==== - Ignore some headers ====
 +
 +Since DSPAM will take the entire email into accounts when calculating probabilities,​ it might be interesting to ignore some specific headers. For example, another antispam'​s headers, a DKIM signature, a date or a user agent might not be very useful to determine whether or not an email is a spam.
 +
 +The configuration example that follow include an extensive list of headers that can be safely ignored. Feel free to expand/​reduce this list.
 +
 +==== - dspam.conf ====
 +Your final configuration file should look like the listing below. Many options are configurable,​ but for a quick overview, this configuration is functional. 
 +Note that we are using the Hash Driver. If you want to use another backend, you need to edit this configuration.
 +<​file>​
 +Home /​var/​spool/​dspam/​
 +StorageDriver /​usr/​lib/​dspam/​libhash_drv.so
 +TrustedDeliveryAgent "/​usr/​bin/​procmail"​
 +DeliveryHost ​           127.0.0.1
 +DeliveryPort ​           10034
 +DeliveryIdent ​          ​localhost
 +DeliveryProto ​          SMTP
 +OnFail error
 +Trust root
 +Trust dspam
 +TrainingMode teft
 +TestConditionalTraining on
 +Feature whitelist
 +Feature tb=5
 +Algorithm graham burton
 +Tokenizer osb
 +Pvalue bcr
 +WebStats on
 +Preference "​trainingMode=TEFT"​
 +Preference "​spamAction=tag"​
 +Preference "​spamSubject=[SPAM]"​
 +Preference "​statisticalSedation=5"​
 +Preference "​enableBNR=on"​
 +Preference "​enableWhitelist=on"​
 +Preference "​signatureLocation=message"​
 +Preference "​tagSpam=on"​
 +Preference "​tagNonspam=off"​
 +Preference "​showFactors=on"​
 +Preference "​optIn=off"​
 +Preference "​optOut=off"​
 +Preference "​whitelistThreshold=20"​
 +Preference "​makeCorpus=off"​
 +Preference "​storeFragments=off"​
 +Preference "​localStore="​
 +Preference "​processorBias=on"​
 +Preference "​fallbackDomain=off"​
 +Preference "​trainPristine=off"​
 +Preference "​optOutClamAV=off"​
 +Preference "​ignoreRBLLookups=off"​
 +Preference "​RBLInoculate=off"​
 +Preference "​notifications=on"​
 +AllowOverride enableBNR
 +AllowOverride enableWhitelist
 +AllowOverride fallbackDomain
 +AllowOverride ignoreGroups
 +AllowOverride ignoreRBLLookups
 +AllowOverride localStore
 +AllowOverride makeCorpus
 +AllowOverride optIn
 +AllowOverride optOut
 +AllowOverride optOutClamAV
 +AllowOverride processorBias
 +AllowOverride RBLInoculate
 +AllowOverride showFactors
 +AllowOverride signatureLocation
 +AllowOverride spamAction
 +AllowOverride spamSubject
 +AllowOverride statisticalSedation
 +AllowOverride storeFragments
 +AllowOverride tagNonspam
 +AllowOverride tagSpam
 +AllowOverride trainPristine
 +AllowOverride trainingMode
 +AllowOverride whitelistThreshold
 +AllowOverride dailyQuarantineSummary
 +AllowOverride notifications
 +HashRecMax ​             6291469
 +HashAutoExtend ​         on
 +HashMaxExtents ​         10000000
 +HashExtentSize ​         49157
 +HashPctIncrease ​        10
 +HashMaxSeek ​            10
 +HashConnectionCache ​    10
 +Notifications ​  on
 +IgnoreHeader Accept-Language
 +IgnoreHeader Approved
 +IgnoreHeader Archive
 +IgnoreHeader Authentication-Results
 +IgnoreHeader Cache-Post-Path
 +IgnoreHeader Cancel-Key
 +IgnoreHeader Cancel-Lock
 +IgnoreHeader Complaints-To
 +IgnoreHeader Content-Description
 +IgnoreHeader Content-Disposition
 +IgnoreHeader Content-ID
 +IgnoreHeader Content-Language
 +IgnoreHeader Content-Return
 +IgnoreHeader Content-Transfer-Encoding
 +IgnoreHeader Content-Type
 +IgnoreHeader DKIM-Signature
 +IgnoreHeader Date
 +IgnoreHeader Disposition-Notification-To
 +IgnoreHeader DomainKey-Signature
 +IgnoreHeader Importance
 +IgnoreHeader In-Reply-To
 +IgnoreHeader Injection-Info
 +IgnoreHeader Lines
 +IgnoreHeader List-Archive
 +IgnoreHeader List-Help
 +IgnoreHeader List-Id
 +IgnoreHeader List-Post
 +IgnoreHeader List-Subscribe
 +IgnoreHeader List-Unsubscribe
 +IgnoreHeader Message-ID
 +IgnoreHeader Message-Id
 +IgnoreHeader NNTP-Posting-Date
 +IgnoreHeader NNTP-Posting-Host
 +IgnoreHeader Newsgroups
 +IgnoreHeader OpenPGP
 +IgnoreHeader Organization
 +IgnoreHeader Originator
 +IgnoreHeader PGP-ID
 +IgnoreHeader Path
 +IgnoreHeader Received
 +IgnoreHeader Received-SPF
 +IgnoreHeader References
 +IgnoreHeader Reply-To
 +IgnoreHeader Resent-Date
 +IgnoreHeader Resent-From
 +IgnoreHeader Resent-Message-ID
 +IgnoreHeader Thread-Index
 +IgnoreHeader Thread-Topic
 +IgnoreHeader User-Agent
 +IgnoreHeader X--MailScanner-SpamCheck
 +IgnoreHeader X-AV-Scanned
 +IgnoreHeader X-AVAS-Spam-Level
 +IgnoreHeader X-AVAS-Spam-Score
 +IgnoreHeader X-AVAS-Spam-Status
 +IgnoreHeader X-AVAS-Spam-Symbols
 +IgnoreHeader X-AVAS-Virus-Status
 +IgnoreHeader X-AVK-Virus-Check
 +IgnoreHeader X-Abuse
 +IgnoreHeader X-Abuse-Contact
 +IgnoreHeader X-Abuse-Info
 +IgnoreHeader X-Abuse-Management
 +IgnoreHeader X-Abuse-To
 +IgnoreHeader X-Abuse-and-DMCA-Info
 +IgnoreHeader X-Accept-Language
 +IgnoreHeader X-Admission-MailScanner-SpamCheck
 +IgnoreHeader X-Admission-MailScanner-SpamScore
 +IgnoreHeader X-Amavis-Alert
 +IgnoreHeader X-Amavis-Hold
 +IgnoreHeader X-Amavis-Modified
 +IgnoreHeader X-Amavis-OS-Fingerprint
 +IgnoreHeader X-Amavis-PenPals
 +IgnoreHeader X-Amavis-PolicyBank
 +IgnoreHeader X-AntiVirus
 +IgnoreHeader X-Antispam
 +IgnoreHeader X-Antivirus
 +IgnoreHeader X-Antivirus-Scanner
 +IgnoreHeader X-Antivirus-Status
 +IgnoreHeader X-Archive
 +IgnoreHeader X-Assp-Spam-Prob
 +IgnoreHeader X-Attention
 +IgnoreHeader X-BTI-AntiSpam
 +IgnoreHeader X-Barracuda
 +IgnoreHeader X-Barracuda-Bayes
 +IgnoreHeader X-Barracuda-Spam-Flag
 +IgnoreHeader X-Barracuda-Spam-Report
 +IgnoreHeader X-Barracuda-Spam-Score
 +IgnoreHeader X-Barracuda-Spam-Status
 +IgnoreHeader X-Barracuda-Virus-Scanned
 +IgnoreHeader X-BeenThere
 +IgnoreHeader X-Bogosity
 +IgnoreHeader X-Brightmail-Tracker
 +IgnoreHeader X-CRM114-CacheID
 +IgnoreHeader X-CRM114-Status
 +IgnoreHeader X-CRM114-Version
 +IgnoreHeader X-CTASD-IP
 +IgnoreHeader X-CTASD-RefID
 +IgnoreHeader X-CTASD-Sender
 +IgnoreHeader X-Cache
 +IgnoreHeader X-ClamAntiVirus-Scanner
 +IgnoreHeader X-Comment-To
 +IgnoreHeader X-Comments
 +IgnoreHeader X-Complaints
 +IgnoreHeader X-Complaints-Info
 +IgnoreHeader X-Complaints-To
 +IgnoreHeader X-DKIM
 +IgnoreHeader X-DMCA-Complaints-To
 +IgnoreHeader X-DMCA-Notifications
 +IgnoreHeader X-Despammed-Tracer
 +IgnoreHeader X-ELTE-SpamCheck
 +IgnoreHeader X-ELTE-SpamCheck-Details
 +IgnoreHeader X-ELTE-SpamScore
 +IgnoreHeader X-ELTE-SpamVersion
 +IgnoreHeader X-ELTE-VirusStatus
 +IgnoreHeader X-Enigmail-Supports
 +IgnoreHeader X-Enigmail-Version
 +IgnoreHeader X-Evolution-Source
 +IgnoreHeader X-Extra-Info
 +IgnoreHeader X-FSFE-MailScanner
 +IgnoreHeader X-FSFE-MailScanner-From
 +IgnoreHeader X-Face
 +IgnoreHeader X-Fellowship-MailScanner
 +IgnoreHeader X-Fellowship-MailScanner-From
 +IgnoreHeader X-Forwarded
 +IgnoreHeader X-GMX-Antispam
 +IgnoreHeader X-GMX-Antivirus
 +IgnoreHeader X-GPG-Fingerprint
 +IgnoreHeader X-GPG-Key-ID
 +IgnoreHeader X-GPS-DegDec
 +IgnoreHeader X-GPS-MGRS
 +IgnoreHeader X-GWSPAM
 +IgnoreHeader X-Gateway
 +IgnoreHeader X-Greylist
 +IgnoreHeader X-HTMLM
 +IgnoreHeader X-HTMLM-Info
 +IgnoreHeader X-HTMLM-Score
 +IgnoreHeader X-HTTP-Posting-Host
 +IgnoreHeader X-HTTP-UserAgent
 +IgnoreHeader X-HTTP-Via
 +IgnoreHeader X-Headers-End
 +IgnoreHeader X-ID
 +IgnoreHeader X-IMAIL-SPAM-STATISTICS
 +IgnoreHeader X-IMAIL-SPAM-URL-DBL
 +IgnoreHeader X-IMAIL-SPAM-VALFROM
 +IgnoreHeader X-IMAIL-SPAM-VALHELO
 +IgnoreHeader X-IMAIL-SPAM-VALREVDNS
 +IgnoreHeader X-Info
 +IgnoreHeader X-IronPort-Anti-Spam-Filtered
 +IgnoreHeader X-IronPort-Anti-Spam-Result
 +IgnoreHeader X-KSV-Antispam
 +IgnoreHeader X-Kaspersky-Antivirus
 +IgnoreHeader X-MDAV-Processed
 +IgnoreHeader X-MDRemoteIP
 +IgnoreHeader X-MDaemon-Deliver-To
 +IgnoreHeader X-MIE-MailScanner-SpamCheck
 +IgnoreHeader X-MIMEOLE
 +IgnoreHeader X-MIMETrack
 +IgnoreHeader X-MMS-Spam-Filter-ID
 +IgnoreHeader X-MS-Exchange-Forest-RulesExecuted
 +IgnoreHeader X-MS-Exchange-Organization-Antispam-Report
 +IgnoreHeader X-MS-Exchange-Organization-AuthAs
 +IgnoreHeader X-MS-Exchange-Organization-AuthDomain
 +IgnoreHeader X-MS-Exchange-Organization-AuthMechanism
 +IgnoreHeader X-MS-Exchange-Organization-AuthSource
 +IgnoreHeader X-MS-Exchange-Organization-Journal-Report
 +IgnoreHeader X-MS-Exchange-Organization-Original-Scl
 +IgnoreHeader X-MS-Exchange-Organization-Original-Sender
 +IgnoreHeader X-MS-Exchange-Organization-OriginalArrivalTime
 +IgnoreHeader X-MS-Exchange-Organization-OriginalSize
 +IgnoreHeader X-MS-Exchange-Organization-PCL
 +IgnoreHeader X-MS-Exchange-Organization-Quarantine
 +IgnoreHeader X-MS-Exchange-Organization-SCL
 +IgnoreHeader X-MS-Exchange-Organization-SenderIdResult
 +IgnoreHeader X-MS-Has-Attach
 +IgnoreHeader X-MS-TNEF-Correlator
 +IgnoreHeader X-MSMail-Priority
 +IgnoreHeader X-MailScanner
 +IgnoreHeader X-MailScanner-Information
 +IgnoreHeader X-MailScanner-SpamCheck
 +IgnoreHeader X-Mailer
 +IgnoreHeader X-Mailman-Version
 +IgnoreHeader X-Mlf-Spam-Status
 +IgnoreHeader X-NAI-Spam-Checker-Version
 +IgnoreHeader X-NAI-Spam-Flag
 +IgnoreHeader X-NAI-Spam-Level
 +IgnoreHeader X-NAI-Spam-Report
 +IgnoreHeader X-NAI-Spam-Route
 +IgnoreHeader X-NAI-Spam-Rules
 +IgnoreHeader X-NAI-Spam-Score
 +IgnoreHeader X-NAI-Spam-Threshold
 +IgnoreHeader X-NEWT-spamscore
 +IgnoreHeader X-NNTP-Posting-Date
 +IgnoreHeader X-NNTP-Posting-Host
 +IgnoreHeader X-NetcoreISpam1-ECMScanner
 +IgnoreHeader X-NetcoreISpam1-ECMScanner-From
 +IgnoreHeader X-NetcoreISpam1-ECMScanner-Information
 +IgnoreHeader X-NetcoreISpam1-ECMScanner-SpamCheck
 +IgnoreHeader X-NetcoreISpam1-ECMScanner-SpamScore
 +IgnoreHeader X-Newsreader
 +IgnoreHeader X-Newsserver
 +IgnoreHeader X-No-Archive
 +IgnoreHeader X-No-Spam
 +IgnoreHeader X-OSBF-Lua-Score
 +IgnoreHeader X-OWM-SpamCheck
 +IgnoreHeader X-OWM-VirusCheck
 +IgnoreHeader X-Olypen-Virus
 +IgnoreHeader X-Orig-Path
 +IgnoreHeader X-OriginalArrivalTime
 +IgnoreHeader X-Originating-IP
 +IgnoreHeader X-PAA-AntiVirus
 +IgnoreHeader X-PAA-AntiVirus-Message
 +IgnoreHeader X-PGP-Fingerprint
 +IgnoreHeader X-PGP-Hash
 +IgnoreHeader X-PGP-ID
 +IgnoreHeader X-PGP-Key
 +IgnoreHeader X-PGP-Key-Fingerprint
 +IgnoreHeader X-PGP-KeyID
 +IgnoreHeader X-PGP-Sig
 +IgnoreHeader X-PIRONET-NDH-MailScanner-SpamCheck
 +IgnoreHeader X-PIRONET-NDH-MailScanner-SpamScore
 +IgnoreHeader X-PMX
 +IgnoreHeader X-PMX-Version
 +IgnoreHeader X-PN-SPAMFiltered
 +IgnoreHeader X-Posting-Agent
 +IgnoreHeader X-Posting-ID
 +IgnoreHeader X-Posting-IP
 +IgnoreHeader X-Priority
 +IgnoreHeader X-Proofpoint-Spam-Details
 +IgnoreHeader X-Qmail-Scanner-1.25st
 +IgnoreHeader X-Quarantine-ID
 +IgnoreHeader X-RAV-AntiVirus
 +IgnoreHeader X-RITmySpam
 +IgnoreHeader X-RITmySpam-IP
 +IgnoreHeader X-RITmySpam-Spam
 +IgnoreHeader X-Rc-Spam
 +IgnoreHeader X-Rc-Virus
 +IgnoreHeader X-Received-Date
 +IgnoreHeader X-RedHat-Spam-Score
 +IgnoreHeader X-RedHat-Spam-Warning
 +IgnoreHeader X-RegEx
 +IgnoreHeader X-RegEx-Score
 +IgnoreHeader X-Rocket-Spam
 +IgnoreHeader X-SA-GROUP
 +IgnoreHeader X-SA-RECEIPTSTATUS
 +IgnoreHeader X-STA-NotSpam
 +IgnoreHeader X-STA-Spam
 +IgnoreHeader X-Scam-grey
 +IgnoreHeader X-Scanned-By
 +IgnoreHeader X-Sender
 +IgnoreHeader X-SenderID
 +IgnoreHeader X-Sohu-Antivirus
 +IgnoreHeader X-Spam
 +IgnoreHeader X-Spam-ASN
 +IgnoreHeader X-Spam-Check
 +IgnoreHeader X-Spam-Checked-By
 +IgnoreHeader X-Spam-Checker
 +IgnoreHeader X-Spam-Checker-Version
 +IgnoreHeader X-Spam-Clean
 +IgnoreHeader X-Spam-DCC
 +IgnoreHeader X-Spam-Details
 +IgnoreHeader X-Spam-Filter
 +IgnoreHeader X-Spam-Filtered
 +IgnoreHeader X-Spam-Flag
 +IgnoreHeader X-Spam-Level
 +IgnoreHeader X-Spam-OrigSender
 +IgnoreHeader X-Spam-Pct
 +IgnoreHeader X-Spam-Prev-Subject
 +IgnoreHeader X-Spam-Processed
 +IgnoreHeader X-Spam-Pyzor
 +IgnoreHeader X-Spam-Rating
 +IgnoreHeader X-Spam-Report
 +IgnoreHeader X-Spam-Scanned
 +IgnoreHeader X-Spam-Score
 +IgnoreHeader X-Spam-Status
 +IgnoreHeader X-Spam-Tagged
 +IgnoreHeader X-Spam-Tests
 +IgnoreHeader X-Spam-Tests-Failed
 +IgnoreHeader X-Spam-Virus
 +IgnoreHeader X-Spam-Warning
 +IgnoreHeader X-Spam-detection-level
 +IgnoreHeader X-SpamAssassin-Clean
 +IgnoreHeader X-SpamAssassin-Warning
 +IgnoreHeader X-SpamBouncer
 +IgnoreHeader X-SpamCatcher-Score
 +IgnoreHeader X-SpamCop-Checked
 +IgnoreHeader X-SpamCop-Disposition
 +IgnoreHeader X-SpamCop-Whitelisted
 +IgnoreHeader X-SpamDetected
 +IgnoreHeader X-SpamInfo
 +IgnoreHeader X-SpamPal
 +IgnoreHeader X-SpamPal-Timeout
 +IgnoreHeader X-SpamReason
 +IgnoreHeader X-SpamScore
 +IgnoreHeader X-SpamTest-Categories
 +IgnoreHeader X-SpamTest-Info
 +IgnoreHeader X-SpamTest-Method
 +IgnoreHeader X-SpamTest-Status
 +IgnoreHeader X-SpamTest-Version
 +IgnoreHeader X-Spamadvice
 +IgnoreHeader X-Spamarrest-noauth
 +IgnoreHeader X-Spamarrest-speedcode
 +IgnoreHeader X-Spambayes-Classification
 +IgnoreHeader X-Spamcount
 +IgnoreHeader X-Spamsensitivity
 +IgnoreHeader X-TERRACE-SPAMMARK
 +IgnoreHeader X-TERRACE-SPAMRATE
 +IgnoreHeader X-TM-AS-Category-Info
 +IgnoreHeader X-TM-AS-MatchedID
 +IgnoreHeader X-TM-AS-Product-Ver
 +IgnoreHeader X-TM-AS-Result
 +IgnoreHeader X-TMWD-Spam-Summary
 +IgnoreHeader X-TNEFEvaluated
 +IgnoreHeader X-Text-Classification
 +IgnoreHeader X-Text-Classification-Data
 +IgnoreHeader X-Trace
 +IgnoreHeader X-UCD-Spam-Score
 +IgnoreHeader X-User-Agent
 +IgnoreHeader X-User-ID
 +IgnoreHeader X-User-System
 +IgnoreHeader X-Virus-Check
 +IgnoreHeader X-Virus-Checked
 +IgnoreHeader X-Virus-Checker-Version
 +IgnoreHeader X-Virus-Scan
 +IgnoreHeader X-Virus-Scanned
 +IgnoreHeader X-Virus-Scanner
 +IgnoreHeader X-Virus-Scanner-Result
 +IgnoreHeader X-Virus-Status
 +IgnoreHeader X-VirusChecked
 +IgnoreHeader X-Virusscan
 +IgnoreHeader X-WSS-ID
 +IgnoreHeader X-WinProxy-AntiVirus
 +IgnoreHeader X-WinProxy-AntiVirus-Message
 +IgnoreHeader X-Yandex-Forward
 +IgnoreHeader X-Yandex-Front
 +IgnoreHeader X-Yandex-Spam
 +IgnoreHeader X-Yandex-TimeMark
 +IgnoreHeader X-cid
 +IgnoreHeader X-iHateSpam-Checked
 +IgnoreHeader X-iHateSpam-Quarantined
 +IgnoreHeader X-policyd-weight
 +IgnoreHeader X-purgate
 +IgnoreHeader X-purgate-Ad
 +IgnoreHeader X-purgate-ID
 +IgnoreHeader X-sgxh1
 +IgnoreHeader X-to-viruscore
 +IgnoreHeader Xref
 +IgnoreHeader acceptlanguage
 +IgnoreHeader thread-index
 +IgnoreHeader x-uscspam
 +PurgeSignatures 14
 +PurgeNeutral ​   90
 +PurgeUnused ​    90
 +PurgeHapaxes ​   30
 +PurgeHits1S ​    15
 +PurgeHits1I ​    15
 +LocalMX 127.0.0.1
 +SystemLog ​      on
 +UserLog ​        on
 +Opt out
 +ServerHost ​             127.0.0.1
 +ServerPort ​             10033
 +ServerQueueSize 32
 +ServerPID ​              /​var/​run/​dspam.pid
 +ServerMode auto
 +ServerParameters ​       "​--deliver=innocent,​spam -d %u"
 +ServerIdent ​            "​localhost.localdomain"​
 +ProcessorURLContext on
 +ProcessorBias on
 +StripRcptDomain off
 +</​file>​
 +===== - A quick test that will not work =====
 +To start the daemon as user '​dspam',​ the Debian standard method is to use start-stop-daemon,​ as follows: 
 +<​code>​
 +# start-stop-daemon --start --chuid dspam --exec /​usr/​bin/​dspam -- --daemon
 +</​code>​
 +<​note>​
 +DSPAM automatically creates its pid in /​var/​run. Make sure the user dspam can write in this directory.
 +</​note>​
 +We get a process started and a listening port:
 +<​code>​
 +UID        PID  PPID  C STIME TTY          TIME CMD
 +dspam    27473     ​1 ​ 0 03:26 pts/0    00:00:00 /​usr/​bin/​dspam --daemon
 +Proto Recv-Q Send-Q Local Address ​   Foreign Address State  User Inode  PID/Program name 
 +tcp     ​0 ​           0          127.0.0.1:​10033 ​ 0.0.0.0:​* ​          ​LISTEN 999  18244  27473/dspam
 +</​code>​
 +The daemon responds on this port, therefore, we can see what happens when trying to send an email:
 +<​code>​
 +$ nc localhost 10033
 +220 DSPAM LMTP 3.9.1 Ready
 +lhlo mail
 +250-localhost.localdomain
 +250-PIPELINING
 +250-ENHANCEDSTATUSCODES
 +250-8BITMIME
 +250 SIZE
 +mail from:<​jp.troll@gmail.com>​
 +250 2.1.0 OK
 +rcpt to:<​jean-kevin@debian.lab>​
 +250 2.1.5 OK
 +data
 +354 Enter mail, end with « . » on a line by itself
 +From: Jean-Pierre Troll <​jp.troll@gmail.com>​
 +To: Jean-Kevin De La Motte <​jean-kevin@debian.lab>​
 +Subject: This is Not a Spam
 +might be a troll, but a spam... no!
 +.
 +421 4.3.0 <​jean-kevin@debian.lab>​ Unable to connect to server quit
 +221 2.0.0 OK
 +</​code>​
 +DSPAM accepts our message but seems to have trouble sending it back to the SMTP server, which is quite normal because we have not configured Postfix yet. 
 +However, let's take a look at the home directory of DSPAM. It has created a tree for the user in **/​var/​spool/​dspam/​data/​debian.lab/​jean-kevin/​**:​
 +<​code>​
 +# tree -s
 +.
 ++-- [         ​23] ​ data
 +¦ +-- [         ​23] ​ debian.lab
 +¦     +-- [        114]  jean-kevin
 +¦         +-- [  100663544] ​ jean-kevin.css
 +¦         +-- [          0]  jean-kevin.lock
 +¦         +-- [         ​85] ​ jean-kevin.log
 +¦         +-- [         ​40] ​ jean-kevin.sig
 +¦         ¦ +-- [        384]  4c873bcd274731106759975.sig
 +¦         +-- [         ​12] ​ jean-kevin.stats
 ++-- [          6]  log
 ++-- [        115]  system.log
 +</​code>​
 +Look more closely at these files, you have a file '​jean-kevin.css,​ which size, 100MB, was  specified as the hash file size in **dspam.conf**.
 +
 +Then, the file '​jean-kevin.log'​ contains a log of processed messages. There,​ we find traces of our message:
 +<​code>​
 +# cat jean-kevin.log 
 +1283931466 I Jean-Pierre Troll <​jp.troll@gmail.com>​ 4c873d4a274731062016872 This Is Not A Spam Delivered 
 +</​code>​
 +Each row has six columns: a unix timestamp, an inspection status (I for inspected, W for whitelisted ...), the sender'​s name and email, an email identifier (DSPAM signature), the message subject and finally the DSPAM status. In this example, the message is marked '​Delivered'​ because, despite the incapacity of DSPAM to connect to Postfix, the message is considered valid.
 +
 +When jean-kevin wants to re-train a message as spam or ham, DSPAM will take the signature, look for a file with this name in '​jean-kevin.sig',​ and update '​jean-kevin.css'​ with the tokens contained within the file. 
 +
 +This DSPAM configuration is functional, we now configure the communication with Postfix.
 +
 +===== - Configure Postfix to connect with DSPAM=====
 +Postfix has a generic method for communicating with software such as DSPAM. That is to treat it as a Content-Filter. ​ Postfix can very easily forward a received message to a content-filter configured in the master.cf file.
 +
 +On a blank configuration of Postfix, you can add the content-filter directly into the principal smtp service (the one that listens on port TCP/​25). For this, we must modify /​etc/​postfix/​master.cf like this:
 +<​file>​
 +# Postfix master process configuration file.  For details on the format ​
 +# of the file, see the master(5) manual page (command: « man 5 master »).
 +#
 +#
 +===============================================================
 +# service type  private unpriv ​ chroot ​ wakeup ​ maxproc command + args 
 +#               ​(yes) ​  (yes) (yes)   ​(never) (100)
 +#
 +===============================================================
 +smtp      inet  n       ​- ​      ​- ​      ​- ​      ​- ​      smtpd
 +      -o content_filter=lmtp:​127.0.0.1:​10033
 +</​file>​
 +This suffices to have Postfix send incoming emails to DSPAM. However,​ to configure the way back, we have to open a new service in master.cf that listens on port TCP/​10034. This time add the new lines at the end of master.cf. 
 +<​file>​
 +127.0.0.1:​10034 inet n  -       ​n ​      ​- ​       -      smtpd
 +      -o content_filter=
 +      -o receive_override_options=no_unknown_recipient_checks,​no_header_body_checks
 +      -o smtpd_helo_restrictions=
 +      -o smtpd_client_restrictions=
 +      -o smtpd_sender_restrictions=
 +      -o smtpd_recipient_restrictions=permit_mynetworks,​reject
 +      -o mynetworks=127.0.0.0/​8
 +      -o smtpd_authorized_xforward_hosts=127.0.0.0/​8
 +</​file>​
 +Reload postfix with '​postfix reload'​. Receiving emails should now work. Repeat the previous test with netcat on localhost, and you should receive the message. To debug, check the following files (on Debian): 
 +  * /​var/​log/​mail.info contains all logs related to the processing of emails 
 +  * /​var/​spool/​dspam/​system.log contains the overall activity of DSPAM (one line per message processed) 
 +  * if you compiled with the debug mode, then set 'Debug *' in dspam.conf and you will get detailed logs in /​var/​spool/​dspam/​log/​
 +  * and, in the worst case scenario, use '​tcpdump -s 16436 -SvnXi lo tcp and port 10033' (or 10034) to listen to communication between Postfix and DSPAM
 +
 +After the mail is passed from Postfix to DSPAM and back to Postfix, it should be received by the recipient as follows: 
 +<​file>​
 +From jp.troll@gmail.com  ​
 +Wed Sep  8 04:02:27 2010 
 +Return-Path:​ <​jp.troll@gmail.com>​
 +X-Original-To:​ jean-kevin@debian.lab
 +Delivered-To:​ jean-kevin@debian.lab
 +From: Jean-Pierre Troll <​jp.troll@gmail.com>​
 +To: Jean-Kevin De La Motte <​jean-kevin@debian.lab> ​
 +Subject: This is Not a Spam
 +Date: Wed,  8 Sep 2010 03:56:49 -0400 (EDT)
 +X-DSPAM-Result:​ Innocent
 +X-DSPAM-Processed:​ Wed Sep  8 04:02:27 2010
 +X-DSPAM-Confidence:​ 0.9899
 +X-DSPAM-Probability:​ 0.0000 ​
 +X-DSPAM-Signature:​ 4c874313289291828119542
 +
 +might be a troll, but a spam... no!
 +!DSPAM:​4c874313289291828119542!
 +</​file>​
 +The message is '​innocent',​ as described in '​X-DSPAM-Result'​.
 +
 +'​X-DSPAM-Probability'​ tells us the probability that the message is spam (the closer the value is to 1, the higher the probability of the message being spam).
 +
 +Finally, '​X-DSPAM-Confidence'​ indicates the confidence level of the filter.
 +
 +If you want more details on the tests performed and the tokens included, enable the preference '​showFactors = on'. It's wordy, but instructive. Each token is then listed with the associated statistical value.
 +
 +<​file>​
 +X-DSPAM-Factors:​ 27,
 +To*La+#​+#​+kevin,​ 0.01000,
 +Subject*This+#​+#​+a,​ 0.01000,
 +To*La+#​+<​jean,​ 0.01000,
 +To*Kevin+#​+La,​ 0.01000,
 +To*Motte+<​jean,​ 0.01000
 +[...] 
 +</​file>​
 +The message body also contains the signature as "​!DSPAM:​ <​signature>​!"​. As mentioned previously, it is preferable to retain the signature in the body of the message because, in this way, it is not deleted when forwarding for training. The other option would be to place the signature in the headers only, but these are usually removed by user agents when a message is forwarded.
 +
 +===== - Managing false positives and false-negative =====
 +Obviously, you shouldn'​t expect DSPAM to get everything perfect right away. It must be fed and learn.
 +
 +First, it is possible to feed DSPAM via the command line using the signature of message. We can report our previous email as spam via the command: 
 +<​code>​
 +# dspam --source=error --class=spam --user jean-kevin@debian.lab --signature=’4c874313289291828119542'​
 +</​code>​
 +In the logs of the user, we will see that the message was '​retrained'​ based on the specified class: spam or innocent.
 +<​file>​
 +# tail -n 1 jean-kevin.log
 +1283934571 M <Not Specified>​ 4c874313289291828119542 <Not Specified>​ Retrained
 +</​file>​
 +This is certainly not the best solution when you have 15,000 users. It is possible to do better by forwarding spam to {spam|notspam}-<​user>​@<​domain>​ (eg. spam-jean-kevin@debian.lab),​ or through the web interface. Both leave control in the user's hands.
 +
 +==== - Learning in forward mode ====
 +Training in forward mode works as follows: when DSPAM inspects a message, it sets a signature in the message body. A user can then forward the same message to DSPAM indicating that it made the wrong decision. 
 +
 +For this to work, DSPAM needs two things; the message signature and the identity of the user.
 +
 +The signing allows DSPAM to find the message in its history and record the change of state. Without this signature, DSPAM is not able to identify the message in its history.(Note:​ the history is preserved 14 days by default. This is set with '​PurgeSignatures'​. More on that later).
 +
 +The identity of the user can be automatically deduced by DSPAM. It will use the added prefix ​ and user email from {spam|notspam}-<​email address>​. Our Users '​jean-kevin@debian.lab'​ will have two aliases '​spam-jean-kevin@debian.lab'​ and '​notspam-jean-kevin@debian.lab'​ which will be dedicated to re-training.
 +
 +DSPAM has a feature to re-train when an email is automatically issued to those aliases. In fact, for each incoming message, it will look at the '​To:'​ header of the body of the message, and if the spam contains {spam|notspam} it will analyze the content and trigger a '​retrain'​. The configuration of this function is quite basic, it goes through the following three directives in '​dspam.conf':​
 +<​file>​
 +ParseToHeaders on
 +ChangeModeOnParse on
 +ChangeUserOnParse full
 +</​file>​
 +The directive '​ParseToHeaders'​ informs DSPAM to cut the '​To:'​ header of the email received to determine if the message contains the keywords {spam|notspam}. This '​To:'​ header is part of the message body, do not confuse it with the SMTP command "rcpt to".
 +
 +With parsing enabled, DSPAM can change the mode of learning according to the first part of the '​To:'​ field. This is controlled by '​ChangeModeOnParse',​ which will enable the class '​spam'​ if the address is '​spam-*'​ and class '​innocent'​ if the address is '​notspam-*'​.
 +
 +Finally, '​ChangeUserOnParse'​ tells DSPAM that the remaining portion of the email address contains the ID of the DSPAM user. Setting it to Full, tells DSPAM to take the user and domain as an identifier, for example '​jean-kevin@debian.lab'​.
 +
 +We must now tell Postfix that users '​spam-jean-kevin@debian.lab'​ and '​notspam-jean-kevin@debian.lab'​ exist. In a production environment,​ you'll certainly have a SQL database or LDAP directory to manage aliases, but in our case, we will simply create two entries in /​etc/​aliases. This will be sufficient for testing.
 +<​code>​
 +# vim /​etc/​aliases
 +[...]
 +spam-jean-kevin:​ jean-kevin
 +notspam-jean-kevin:​ jean-kevin
 +# postalias /​etc/​aliases
 +</​code>​
 +We can now reconnect to Postfix via netcat and inject the same email as above, but now address it to the spam alias. The headers can be ignored, the important sections are the To: Header and the DSPAM signature at the end of the message body.
 +
 +<​code>​
 +$ nc localhost 25
 +220 debian.lab ESMTP Postfix (Debian/​GNU)
 +ehlo mail
 +250-debian.lab
 +250-PIPELINING
 +250-SIZE 10240000
 +250-VRFY
 +250-ETRN
 +250-STARTTLS
 +250-ENHANCEDSTATUSCODES
 +250-8BITMIME
 +250 DSN
 +mail from:<​jean-kevin@debian.lab>​
 +250 2.1.0 Ok
 +rcpt to:<​spam-jean-kevin@debian.lab>​
 +250 2.1.5 Ok
 +data
 +354 End data with <​CR><​LF>​.<​CR><​LF>​
 +From:  Jean-Kevin De La Motte <​jean-kevin@debian.lab> ​
 +To: <​spam-jean-kevin@debian.lab>​
 +Subject: This is Not a Spam
 +might be a troll, but a spam... no!
 +
 +!DSPAM:​4c874313289291828119542
 +
 +250 2.0.0 Ok: queued as 42509114E28
 +quit
 +221 2.0.0 Bye
 +</​code>​
 +Now looking at the DSPAM logs for jean-kevin, we see that the message was '​retrained'​.
 +
 +<​file>​
 +1283936972 ​     M       ​Jean-Kevin De La Motte <​jean-kevin@debian.lab>​ 4c874313289291828119542
 +This is Not a Spam      Retrained <​20100908090905.42509114E28@debian.lab>​
 +</​file>​
 +DSPAM will then forward the message back to Postfix, where it will be delivered back to the user (the prefix is deleted). Text is, however, added at the end of the message informing the user that the message has been a re-trained.
 +
 +These information messages need to be created (they are not ship with DSPAM). One for spam and one for the ham. This can be done as follows:
 +<​code>​
 +# echo '​Scanned and tagged as SPAM by DSPAM on Debian.Lab'​ > /​var/​spool/​dspam/​txt/​msgtag.spam
 +
 +# echo '​Scanned and tagged as HAM by DSPAM on Debian.Lab'​ > /​var/​spool/​dspam/​txt/​msgtag.nonspam
 +</​code>​
 +
 +
 +==== - Training from the web interface ====
 +Using the web interface is necessary if the messages detected as spam are not sent to users but quarantined (Preferences "​spamAction = quarantine"​). Users must regularly check the interface to verify that no false positive is found in quarantine. Users can also use the interface to mark emails as spam or ham.
 +
 +DSPAM sources provide a directory named '​webui'​. This is a set of CGI scripts to control ​ DSPAM through a web interface. No surprise, it's written in Perl. To run it, you have to configure {apache,​lighttpd,​ nginx, ...} to run perl CGI scripts.
 +
 +<​note>​documentation already exists for apache and lighttpd, we chose to describe the configuration for Nginx.</​note>​
 +
 +In fact, it's more complicated than that, because the CGI should be able to determine the identity of the user who connects. So,​ Nginx, in our case, will have to authenticate the user and forward their identity to DSPAM.
 +
 +Nginx does not know how to run external scripts. The only thing it can do is send queries to a FastCGI socket. So we will need another program, which will stand between our Nginx and CGI scripts to execute them, this program is called '​fcgiwrap'​.
 +
 +We will also need some Perl packages required by DSPAM CGI (for parsing the HTML, display graphs with GD, etc. ...).
 +
 +Install the following packages:
 +<​code>​
 +# aptitude install nginx fcgiwrap libcgi-pm-perl libhtml-parser-perl libgd-graph-perl libgd-graph3d-perl
 +</​code>​
 +The DSPAM interface needs permissions to access '/​var/​spool/​dspam'​ for both reading and writing, since it will change preferences and state of the dictionaries. Since fcgiwrap will be the process executing the Perl scripts, we will launch it as user/group '​dspam'​.
 +
 +We will also give world write access to the fcgiwrap socket so nginx can write to it.
 +
 +<​note>​This is a test configuration,​ as the proverb says "Do not do this at home."</​note>​
 +
 +<​code>​
 +# vim /​etc/​init.d/​fcgiwrap
 +[..]
 +FCGI_USER= »dspam »
 +FCGI_GROUP= »dspam »
 +[...]
 +# /​etc/​init.d/​fcgiwrap restart
 +# chmod o+w /​var/​run/​fcgiwrap.socket
 +</​code>​
 +Nginx configuration is then easy, it just forwards requests to CGI fcgiwrap. It must also authenticate users so that DSPAM can determine the identity of the visitor. This identity is stored in the variable REMOTE_USER,​ set by nginx and provided to fcgiwrap.
 +
 +<​code>​
 +
 +# vim /​etc/​nginx/​sites-available/​default
 +[...]
 + location /​dspam/​cgi-bin {
 + auth_basic ​     « DSPAM »;
 + auth_basic_user_file ​ /​var/​www/​dspam/​passwords; ​
 + include /​etc/​nginx/​fastcgi_params;​
 + index dspam.cgi;
 + fastcgi_param ​ SCRIPT_FILENAME $document_root$fastcgi_script_name;​
 + fastcgi_param REMOTE_USER ​ $remote_user;​
 + if ($uri ~ « \.cgi$ »){
 + fastcgi_pass ​ unix:/​var/​run/​fcgiwrap.socket;​
 +              }
 + }
 +# /​etc/​init.d/​nginx restart
 +</​code>​
 +You must then create a file '/​var/​www/​dspam/​passwords',​ via the tool htpasswd. This file should contain one line per user, the username is the user's complete email address.
 +
 +<​code>​
 +# htpasswd -c /​var/​www/​dspam/​passwords jean-kevin@debian.lab ​
 +New password:
 +Re-type new password:
 +Adding password for user jean-kevin@debian.lab
 +# cat /​var/​www/​dspam/​passwords
 +jean-kevin@debian.lab:​H2CigqsDz1U4E
 +# chown dspam:​www-data /​var/​www/​dspam/​passwords ​
 +# chmod o-rwx /​var/​www/​dspam/​password
 +</​code>​
 +The infrastructure is ready, copy the files from DSPAM sources '​webui'​ directory directly into the '​document root' of nginx.
 +
 +<​code>​
 +# cp -r ~/​dspam-3.9.1-RC1/​webui/​* /​var/​www/​dspam/​
 +# chown dspam:​www-data /​var/​www/​dspam -R
 +</​code>​
 +At this stage, we still have some configuration to do. The script '/​var/​www/​dspam/​cgi-bin/​configure.pl'​ contains the configuration for the web interface to identify the directories of DSPAM. So check the values of $CONFIG{’DSPAM_HOME’},​
 +$CONFIG{’DSPAM_BIN’},​ etc, so that they corresponds to our environment. 
 +<​file>​
 +$CONFIG{’DSPAM_HOME’} ​  = “/​var/​spool/​dspam”; ​
 +$CONFIG{’DSPAM_BIN’} ​   = “/​usr/​bin”;​
 +[...]
 +$CONFIG{’WEB_ROOT’} ​    = “/​dspam/​htdocs/​”;​
 +[...]
 +$CONFIG{’LOCAL_DOMAIN’} = “debian.lab”;​
 +</​file>​
 +With all this, we should be able to open the page http://​myserver/​dspam/​cgi-bin/​. Log in with user jean-kevin@debian.lab,​ and access the DSPAM interface. 
 +It allows, among other things, re-training of messages already processed from the tab '​History'​.You can also change the preferences,​ etc.
 +
 +
 +The interface provides an administration section. To have access to it, you need to declare an admin in the file ‘/​var/​www/​dspam/​cgi-bin/​admins’.
 +<​code>​
 +# echo ‘jean-kevin@debian.lab’ >> /​var/​www/​dspam/​cgi-bin/​admin
 +</​code>​
 +We can then access the URL http://​myserver/​dspam/​cgi-bin/​admin.cgi and admire the beautiful graphics work, or change the default options.
 +
 +Below are a few screenshots from the web interface:
 +
 +
 +{{ :​en:​ressources:​dossiers:​dspam_cgi_performance.png?​400 }}
 +//The home page of the interface displays performances statistics for the current user.//
 +
 +{{ :​en:​ressources:​dossiers:​dspam_cgi_history.png?​400 }}
 +//The message history lists all inspected messages. It can be used to retrain a message.//
 +
 +{{ :​en:​ressources:​dossiers:​dspam_cgi_14daysstats.png }}
 +//This is a 14 days graph for a user receiving a lot of spam.//
 +
 +===== - Users management =====
 +
 +When using virtual_ids,​ which is the most common method to manage last groups of users, the users and stored in the database. With Postgresql, you can browse the "​dspam"​ database (as created previously) using standard sql commands:
 +
 +<​code>​
 +postgres@server:/​$ psql 
 +psql (8.4.8)
 +Saisissez « help » pour l'​aide.
 +
 +postgres=# \c dspam
 +psql (8.4.8)
 +Vous êtes maintenant connecté à la base de données « dspam ».
 +
 +dspam=# \d
 +                    Liste des relations
 + ​Schéma |          Nom           ​| ​  ​Type ​  | Propriétaire ​
 +--------+------------------------+----------+--------------
 + ​public | dspam_preferences ​     | table    | dspam
 + ​public | dspam_signature_data ​  | table    | dspam
 + ​public | dspam_stats ​           | table    | dspam
 + ​public | dspam_token_data ​      | table    | dspam
 + ​public | dspam_virtual_uids ​    | table    | dspam
 + ​public | dspam_virtual_uids_seq | séquence | dspam
 +(6 lignes)
 +
 +
 +dspam=# \d dspam_virtual_uids
 +                                 Table « public.dspam_virtual_uids »
 + ​Colonne ​ |          Type          |                          Modificateurs ​                          
 +----------+------------------------+------------------------------------------------------------------
 + ​uid ​     | integer ​               | non NULL Par défaut, nextval('​dspam_virtual_uids_seq'::​regclass)
 + ​username | character varying(128) | 
 +Index :
 +    "​dspam_virtual_uids_pkey"​ PRIMARY KEY, btree (uid)
 +    "​id_virtual_uids_01"​ UNIQUE, btree (username)
 +    "​id_virtual_uids_02"​ UNIQUE, btree (uid)
 +
 +</​code>​
 +
 +As you can see in the description of the database above, there is a table called **dspam_virtual_uids** that will contain a simple mapping of the username with a generated id.
 +
 +Here is how you can obtain the UID of a specific user.
 +
 +<​code>​
 +dspam=# select * from dspam_virtual_uids where username = '​jean-kevin@debian.lab';​
 + uid |       ​username ​       ​
 +-----+-----------------------
 +   1 | jean-kevin@debian.lab
 +(1 ligne)
 +
 +</​code>​
 +
 +
 +==== - Deleting a user ====
 +
 +If, for any reason, you would like to remove a user from the database, you need to obtain its UID and then remove all rows from the others tables referencing this UID. 
 +
 +If you look at the description of the table **dspam_token_data**,​ for example, you will see that each token is attached to the UID of the user it belong too, thus making it extremely easy to identify and delete.
 +
 +<​code>​
 +dspam=# \d dspam_preferences
 +         Table « public.dspam_preferences »
 +  Colonne ​  ​| ​         Type          | Modificateurs ​
 +------------+------------------------+---------------
 + ​uid ​       | integer ​               | 
 + ​preference | character varying(128) | 
 + ​value ​     | character varying(128) | 
 +Index :
 +    "​dspam_preferences_uid_key"​ UNIQUE, btree (uid, preference)
 +
 +
 +dspam=# delete from dspam_token_data where uid in (select uid from dspam_virtual_uids where username = '​jean-kevin@debian.lab';​
 +DELETE 2187
 +
 +</​code>​
 +Repeat this step for all tables and delete the user from **dspam_virtual_uids** at the end.
 +
 +Also, make sure to delete the user's data folder from /​var/​spool/​dspam/​data/<​domain>/<​user>​ if you really want to remove all traces of the user.
 +
 +
 +===== - Group management and inoculation =====
 +While DSPAM analysis focuses on the user, it also enables groups of users to share data. By properly defining these groups, for example through their activity, we can expect the content of messages to be similar and therefore the  tokens statistics to be similar. Sharing this information helps to accelerate the DSPAM training.
 +
 +We saw that each user has its dictionary of tokens in the file '<​user>​.css'​ (if using the Hash driver). This dictionary contains the tokens and associated statistics, produced by the user.
 +
 +DSPAM can share these tokens and statistics in different ways (or types of groups).
 +
 +  * **Shared**: group members share the same dictionary, but each member retains his own quarantine directory. Problem:​ If a user's behavior is different from the rest of the group, it will disrupt the whole group, starting with himself.
 +
 +  * **Shared,​Managed**:​ same as shared group, but with a single quarantine mailbox.
 +
 +  * **Classification**:​ share the individual dictionaries. If the user's dictionary does not allow a user to determine if a message is spam or innocent (confidence <0.65 or dictionary containing less than 1000 innocent messages and 250 spam), the other group member dictionaries are used. The analysis stops when a class dictionary classifies the message. In practice, this group is a chain containing all users in the group, which is traversed linearly until a decision is reached. Each user should be listed as a member of the group for querying dictionaries of other members.
 +
 +  * **Global**: an alternative to classification groups. This group type is used to define a Global classification group in which all members of the system can query dictionaries of members listed. If a user dictionary is not sufficient to classify a message, then it ask the opinion of the members of Global, by traversing the chain of members until a formal decision is reached. In short, Global is a sort of "​council of wise men" that each user can query.
 +
 +  * **Merged**: Merged assembles the user dictionary and the dictionary referenced to form one new dictionary and use it for analysis. New user specific tokens are always written back to the user dictionary. Training the Merged group alone (without the members) will influence the accuracy for each Merged group member.
 +
 +  * **Inoculation**:​ This last group is somewhat unusual. It is the principle of vaccination,​ and allows a user having received spam not detected to inform all other users that this message is spam. Thus,​ each user has its own dictionary, which he uses exclusively for analysis, but users can exchange tokens between them. The first user is infected, the others are vaccinated. This principle of inoculation also allows user to define a bin, a honeypot for spam, which receive only spam and will thus accelerate the learning for everyone. This second mode is called '​external inoculation'​.
 +
 +==== - Setting up a group ====
 +Setting up a group is rather simple, the hardest part is to determine the correct group for your environment,​ and then to monitor the behavior over several weeks.
 +
 +In our example we will implement a group of type “classification”. Since this type allows each user to retain his personal dictionary, it has little impact on the infrastructure (in case you want to delete the group).
 +
 +DSPAM reads the group configuration from a text file located its 'Home Directory'​. For us, that  would be under '/​var/​spool/​dspam/​group'​. The file contains one line per group in the form <​groupName>:<​type>:<​user 1>, ..., < user n>
 +
 +We will create a group of type '​classification'​ including users jean-kevin, julien and root, and we will call this group '​class-debian-lab'​.
 +
 +<​code>​
 +# echo "​class-debian-lab:​classification:​jean-kevin@debian.lab,​julien@debian.lab,​root@debian.lab"​ > /​var/​spool/​dspam/​group
 +# chown dspam:dspam /​var/​spool/​dspam/​group
 +# kill `pidof dspam`
 +# start-stop-daemon --start --chuid dspam --exec /​usr/​bin/​dspam -- --daemon
 +</​code>​
 +By enabling the debug trace in dspam.conf (Directive 'Debug *' when dspam is compiled with debug mode), we can see the group being used in the file '/​var/​spool/​dspam/​log/​dspam.debug'​.
 +<​file>​
 +10150: [09/08/2010 14:43:19] user jean-kevin@debian.lab is member of classification group
 +class-debian-lab
 +10150: [09/08/2010 14:43:19] adding user julien@debian.lab to classification network group
 +10150: [09/08/2010 14:43:19] adding user root@debian.lab to classification network group
 +</​file>​
 +
 +===== - Maintenance =====
 +
 +==== - dspam_logrotate ====
 +This program provides log rotation for both system and DSPAM user logs (those stored in /​var/​spool/​dspam).
 +
 +The command can be run for a specific user or for all users in the dspam directory. In our case, we want to achieve rotation for everyone when logs exceed 60 days. We will therefore put the following in crontab:
 +<​file>​
 +30 5    * * *   ​dspam ​  /​usr/​bin/​dspam_logrotate -a 60 -d /​var/​spool/​dspam/​data/​
 +</​file>​
 +
 +==== - Hash Driver cleanup ====
 +DSPAM'​s hash driver stores a large amount of information,​ be it for tokens or history. It therefore provides a tool to do some cleaning. '​dspam_clean'​ will clean up the dictionaries using the parameters defined in dspam.conf. 
 +
 +=== - dspam_clean ===
 +The default configuration for '​dspam_clean'​ is to retain all signatures for 14 days and clean the little used tokens after 15, 30 and 90 days depending on the type. Again,​ the configuration file provided by the sources is rather well commented. 
 +<​file>​
 +#
 +# Purge configuration:​ Set dspam_clean purge default options, if not otherwise ​
 +# specified on the commandline
 +#
 +PurgeSignatures 14      # Stale signatures
 +PurgeNeutral ​   90      # Tokens with neutralish probabilities
 +PurgeUnused ​    ​90 ​     # Unused tokens
 +PurgeHapaxes ​   30      # Tokens with less than 5 hits (hapaxes)
 +PurgeHits1S ​    ​15 ​     # Tokens with only 1 spam hit
 +PurgeHits1I ​    ​15 ​     # Tokens with only 1 innocent hit
 +</​file>​
 +To achieve a periodic purge, add dspam_clean to the dspam user's cron. For example, with a command in /​etc/​crontab that starts every day at 5:
 +<​file>​
 +0  5    * * *   ​dspam ​  /​usr/​bin/​dspam_clean -s -p -u
 +</​file>​
 +This command will perform the purge of the three types of information,​ including signatures, and neutral tokens that are not used.
 +
 +==== - Databases cleanup ====
 +If you are using a database backend, and not the Hash driver, you need an external script to connect to the database and clean the tokens.
 +
 +The script **contrib/​dspam_maintenance/​dspam_maintenance.sh** is written to connect to any of the 3 types of database backend DSPAM supports, and perform that cleanup for you.
 +
 +'​dspam_maintenance.sh'​ will read the Purge Configuration (as described above) from dspam.conf, connect to the backend and perform the cleanup. It requires to have an external set of queries for your database. This is database specific, and can be found in **src/​tools.<​backend>​**
 +<​file>​
 +tools.mysql_drv
 +├── purge-4.1.sql
 +└── purge.sql
 +
 +tools.pgsql_drv
 +├── purge-pe.sql
 +└── purge.sql
 +
 +tools.sqlite_drv
 +├── purge-2.sql
 +└── purge-3.sql
 +</​file>​
 +
 +Copy the proper set of queries in **/​var/​spool/​dspam** and give the permissions to user '​dspam'​.
 +<​code>​
 +# cp -r tools.pgsql_drv/​ /​var/​spool/​dspam/​
 +# chown dspam:dspam /​var/​spool/​dspam/​tools.pgsql_drv/​ -R
 +</​code>​
 +
 +Now, copy the '​dspam_maintenance.sh'​ script to /​etc/​cron.daily/​ (or cron.weekly if you prefer), and configure it as follow:
 +<​code>​
 +# cp contrib/​dspam_maintenance/​dspam_maintenance.sh /​etc/​cron.daily/​dspam_maintenance
 +# chmod +x /​etc/​cron.daily/​dspam_maintenance ​
 +# vim /​etc/​cron.daily/​dspam_maintenance ​
 +
 +[...]
 +
 +DSPAM_CONFIGDIR="/​etc/​dspam"​
 +DSPAM_HOMEDIR="/​var/​spool/​dspam/"​
 +DSPAM_PURGE_SCRIPT_DIR="/​var/​spool/​dspam/​tools.pgsql_drv/"​
 +DSPAM_BIN_DIR="/​usr/​bin"​
 +MYSQL_BIN_DIR="/​usr/​bin"​
 +PGSQL_BIN_DIR="/​usr/​bin"​
 +SQLITE_BIN_DIR="/​usr/​bin"​
 +SQLITE3_BIN_DIR="/​usr/​bin"​
 +
 +
 +[...]
 +</​code>​
 +<​note>​Remember that scripts in /etc/cron.* must not contain dots in their names (eg. no dspam_maintenance**.**sh,​ use dspam_maintenance)</​note>​
 +===== - Test Procedure =====
 +To conclude this section, we will demonstrate the test procedure specified in the README file of DSPAM. This procedure allows us to not only verify that the configuration is operational,​ but also to familiarize ourselves with the internal controls of DSPAM.
 +
 +**Step 1**: Create a blank user
 +<​code>​
 +# useradd -d /​home/​michel-rene -U -m michel-rene
 +# passwd michel-rene
 +</​code>​
 +**Step 2**: Send an email to our current new user
 +<​code>​
 +# nc localhost 25 << EOF
 +ehlo mail
 +mail from:<​jp.troll@gmail.com>​
 +rcpt to:<​michel-rene@debian.lab>​
 +data
 +From: <​jp.troll@gmail.com>​
 +To: <​michel-rene@debian.lab>​
 +Subject: Cours message de test
 +10 mots c'est pas assez long pour un troll.
 +.
 +quit
 +EOF
 +</​code>​
 +**Step 3**: Check the statistics of the user account with the command dspam_stats
 +<​code>​
 +# dspam_stats michel-rene@debian.lab
 +michel-rene@debian.lab ​ TP:     0 TN:     1 FP:     0 FN:     0 SC:     0 NC:     0
 +</​code>​
 +**Step 4**: Check the list of tokens and the associated probabilities via dspam_dump
 +<​code>​
 +# dspam_dump michel-rene@debian.lab
 +4311867737599848632 ​ S: 00000  I: 00001  P: 0.4000 LH: Wed Sep  8 21:20:22 2010
 +9486336444479993084 ​ S: 00000  I: 00001  P: 0.4000 LH: Wed Sep  8 21:20:22 2010
 +18360635214432484661 S: 00000  I: 00001  P: 0.4000 LH: Wed Sep  8 21:20:22 2010
 +[…]
 +</​code>​
 +These tokens are associated with an innocent message, that is why the value S (spam) is zero and the value I (for Innocent) is one. Also, take note that the tokenizer '​OSB'​ creates 114 tokens for this small message (a few headers have been added, however, by Postfix).
 +You can see the statistics associated with a particular token in the dictionary by entering its text at the command line. Obviously,​ with OSB as the tokenizer, the difficulty is knowing the original text of the token.
 +<​code>​
 +# dspam_dump michel-rene@debian.lab un+troll
 +1157728372545618534 ​ S: 00000  I: 00001  P: 0.4000
 +
 +# dspam_dump michel-rene@debian.lab assez+#​+#​+#​+troll
 +695260355258399736 ​  S: 00000  I: 000001 ​ P: 0.4000
 +</​code>​
 +**Step 5**: Mark the message as Spam, for example in the web interface.
 +
 +**Step 6**: Check the statistics of DSPAM user again:
 +<​code>​
 +# dspam_stats michel-rene@debian.lab
 +michel-rene@debian.lab ​ TP:     0 TN:     0 FP:     0 FN:     1 SC:     0 NC:     0
 +</​code>​
 +**Step 7**: Check the status of tokens again:
 +<​code>​
 +# dspam_dump michel-rene@debian.lab
 +4311867737599848632 ​ S: 00001  I: 00000  P: 0.4000 LH: Wed Sep  8 21:28:31 2010
 +9486336444479993084 ​ S: 00001  I: 00000  P: 0.4000 LH: Wed Sep  8 21:28:31 2010
 +18360635214432484661 S: 00001  I: 00000  P: 0.4000 LH: Wed Sep  8 21:28:31 2010
 +[…]
 +</​code>​
 +The update completed correctly, these tokens are now associated to spam (S is 1, I is zero). 
 +
 +These few commands can not only control that our anti-spam is functional, but also follow the lifecycle of tokens over time.
 +
 +===== - Conclusion=====
 +
 +Our tour of DSPAM is complete. I have not really talked about success rates and other criteria generally used to classify antispam solutions, for two reasons: firstly, these figures are generally lying, and the results depend heavily on user behavior, so it is difficult to get reproductible figures. And second: there is no real point, today, to have an infrastructure based on a single anti-spam product. Integrating a system with Postfix greylist is trivial, and it is even possible to combine SpamAssassin and DSPAM one behind the other (just call a spamassassin content-filter after returning to Postfix from DSPAM).
 +
 +So in the end, the best way to fight to use multiple techniques, but what we have seen in these pages is that DSPAM is a great tool for this work. It can be a bit difficult to pick up initially, but the result and the flexibility of the product is well worth the initial investment.
 +
 +**Julien Vehent, and the DSPAM team - 2011**
 +
 +
 +~~DISCUSSION:​off~~
en/ressources/dossiers/dspam.txt · Last modified: 2011/10/19 12:46 (external edit)
CC Attribution-Share Alike 4.0 International
Driven by DokuWiki Recent changes RSS feed Valid CSS Valid XHTML 1.0