I have recently decided to compare Bayes classifier in
Rspamd with the closest analogues. I have tried 3 competitors:
Rspamd(version 1.4 git master)
Bogofilter- classical bayesian filter
Dspam- the most advanced bayesian filter used by many projects and people
Dspam, I have tested both
osb tokenization modes. I have tried to test
chi-square probabilities combiner (since the same algorithm is used in
Rspamd), however, I could not make it working somehow.
First of all, I have collected some corpus of messages with about 1k of spam messages and 1k of ham messages. All messages were carefully selected and manually checked. Then, I have written a small script that performs the following steps:
- Split corpus randomly into two equal parts with about 500 messages of Ham and Spam correspondingly.
- Learn bayes classifier using the desired spam filtering engine (
- Use the rest of messages to test classifier after learning procedure.
- Use 95% confidence factor for
Dspam(e.g. when probability of spam is less than 95% then consider that a classifier is in undefined state,
Bogofilter, in turn, automatically provides 3 results:
This script collects 6 main values for each classifier:
- Spam/Ham detection rate - number of messages that are correctly recognized as spam and ham
- Spam FP rate - number of false positives for Spam: HAM messages that are recognized as SPAM
- Ham FP rate - number of false positives for Ham: SPAM messages that are recognized as HAM
- Ham and Spam FN rate - number of messages that are not recognized as Ham or Spam (but not classified as the opposite class, meaning uncertainty for a classifier)
The worse error for a classifier is Spam False Positive, since it detects an innocent message as Spam. Ham FP and false negatives are more permissive: they just mean that you receive more spam than you want.
The raw results are pasted at the following gist.
Here are the corresponding graphs for detection rate and errors for the competitors.
Rspamd Bayes performs very well comparing to the competitors. It provides higher spam detection rate comparing to both
Bogofilter. All competitors demonstrated the common spam false positives rate. However,
Dspam is more aggressive in marking messages as Ham (which is not bad because Bayes is the only check
Rspamd is also much faster in learning and testing. With Redis backend, it learns 1k messages in less than 5 seconds.
Bogofilter both require about 30 seconds to learn.
I have not included
SpamAssassin into the comparison since it uses naive Bayes classifier similar to
Bogofilter. Hence, it’s quality is very close to
Furthermore, unlike competitors,
Rspamd provides a lot of other checks and features. The goal of this particular benchmark was to compare merely Bayesian engines of different spam filters. To summarise, I can conclude that quality of Bayes classifier in
Rspamd is high enough to recommend it for using in the production environments or to replace
Bogofilter in your email system.