Bayes auto-learn - not happening

classic Classic list List threaded Threaded
46 messages Options
123
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Bayes auto-learn - not happening

Scott Techlist
Centos7
Postfix 3.2.2
Amavisd-new 2.11.0
Spamassassin 3.4.0
Site-wide configuration

This is a new box and I've configured some conservative values for auto-learn.  I've enabled it properly AFAIK, but I can't see any sign of it working.  

I have these set in local.cf
use_bayes               1
bayes_auto_learn        1
bayes_auto_learn_threshold_nonspam -1.7
bayes_auto_learn_threshold_spam 10.0
# this is a filename prefix, not a directory per se
bayes_path              /etc/mail/bayes/bayes
bayes_file_mode         0666

-------------bayes prep ----------------
Start fresh for troubleshooting:
su amavis -c 'sa-learn --clear'

Add one spam manually and check tokens:

[root@tn2 mail]# su amavis -c 'sa-learn --dump magic'
0.000          0          3          0  non-token data: bayes db version
0.000          0          1          0  non-token data: nspam
0.000          0          0          0  non-token data: nham
0.000          0       2157          0  non-token data: ntokens

---------amavisd prep----------------

Restart amavisd/spamassassin just to be sure all configs read..

------- ready to process -------------

The next high scoring spam arrives, it was sent to my spam mailbox.  It did NOT autolearn.  Nor did several others.  

To troubleshoot, I took one that did not autolearn, and learned it manually by:
su amavis -c 'sa-learn -D --spam --showdots  --mbox /home/mail/onespam

even though this message was slightly over the threshold, the log says it learned anyway:
-D log snippet:
---------------------
Aug  8 12:37:27.216 [13198] info: archive-iterator: skipping large message: 858 lines, 262203 bytes, limit 262144 bytes

Learned tokens from 1 message(s) (1 message(s) examined)
---------------------

Verified it learned:

[root@tn2 mail]# su amavis -c 'sa-learn --dump magic'
0.000          0          3          0  non-token data: bayes db version
0.000          0          2          0  non-token data: nspam


Partial header from that message:

X-Spam-Flag: YES
X-Spam-Score: 17.374
X-Spam-Level: *****************
X-Spam-Status: Yes, score=17.374 tag=-9999 tag2=5 kill=6.31
        tests=[RCVD_IN_BRBL_LASTEXT=1.644, RCVD_IN_DNSWL_NONE=-0.0001,
        RCVD_IN_RP_RNBL=1.284, RCVD_IN_SBL_CSS=3.558, RCVD_IN_SORBS_WEB=1.5,
        RP_MATCHES_RCVD=-0.001, SUSPICIOUS_RECIPS=2.497,
        URIBL_ABUSE_SURBL=1.948, URIBL_BLACK=1.7, URIBL_DBL_SPAM=2.5,
        URIBL_SBL=0.644, URIBL_SBL_A=0.1] autolearn=no autolearn_force=no

Why aren't my spams getting auto-learned?  If sa-learn "ate" it, shouldn't auto-learn too?

I know there is a default 200 threshold before Bayes starts tagging anything, but I understand it should learn without issue.

Can't figure out what's wrong...













Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Bayes auto-learn - not happening

Benny Pedersen-2
Scott Techlist skrev den 2017-08-08 20:06:

> X-Spam-Flag: YES
> X-Spam-Score: 17.374
> X-Spam-Level: *****************
> X-Spam-Status: Yes, score=17.374 tag=-9999 tag2=5 kill=6.31
>         tests=[RCVD_IN_BRBL_LASTEXT=1.644, RCVD_IN_DNSWL_NONE=-0.0001,
>         RCVD_IN_RP_RNBL=1.284, RCVD_IN_SBL_CSS=3.558,
> RCVD_IN_SORBS_WEB=1.5,
>         RP_MATCHES_RCVD=-0.001, SUSPICIOUS_RECIPS=2.497,
>         URIBL_ABUSE_SURBL=1.948, URIBL_BLACK=1.7, URIBL_DBL_SPAM=2.5,
>         URIBL_SBL=0.644, URIBL_SBL_A=0.1] autolearn=no
> autolearn_force=no

> Can't figure out what's wrong...

some of the listed tags have tflags that disable autolearn

there is nothing to fix here
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Bayes auto-learn - not happening

RW-15
In reply to this post by Scott Techlist
On Tue, 8 Aug 2017 13:06:26 -0500
Scott Techlist wrote:

> Centos7
> Postfix 3.2.2
> Amavisd-new 2.11.0
> Spamassassin 3.4.0
> Site-wide configuration
>
> This is a new box and I've configured some conservative values for
> auto-learn.  I've enabled it properly AFAIK, but I can't see any sign
> of it working.  
>
> I have these set in local.cf
> use_bayes               1
> bayes_auto_learn        1
> bayes_auto_learn_threshold_nonspam -1.7
> bayes_auto_learn_threshold_spam 10.0
> # this is a filename prefix, not a directory per se
> bayes_path              /etc/mail/bayes/bayes
> bayes_file_mode         0666
>
> -------------bayes prep ----------------
> Start fresh for troubleshooting:
> su amavis -c 'sa-learn --clear'
>
> Add one spam manually and check tokens:
>
> [root@tn2 mail]# su amavis -c 'sa-learn --dump magic'
> 0.000          0          3          0  non-token data: bayes db
> version 0.000          0          1          0  non-token data: nspam
> 0.000          0          0          0  non-token data: nham
> 0.000          0       2157          0  non-token data: ntokens
>
> ---------amavisd prep----------------
>
> Restart amavisd/spamassassin just to be sure all configs read..
>
> ------- ready to process -------------
>
> The next high scoring spam arrives, it was sent to my spam mailbox.
> It did NOT autolearn.  Nor did several others.  
>
> To troubleshoot, I took one that did not autolearn, and learned it
> manually by: su amavis -c 'sa-learn -D --spam --showdots
> --mbox /home/mail/onespam
>
> even though this message was slightly over the threshold, the log
> says it learned anyway: -D log snippet:
> ---------------------
> Aug  8 12:37:27.216 [13198] info: archive-iterator: skipping large
> message: 858 lines, 262203 bytes, limit 262144 bytes
>
> Learned tokens from 1 message(s) (1 message(s) examined)
> ---------------------
>
> Verified it learned:
>
> [root@tn2 mail]# su amavis -c 'sa-learn --dump magic'
> 0.000          0          3          0  non-token data: bayes db
> version 0.000          0          2          0  non-token data: nspam
>
>
> Partial header from that message:
>
> X-Spam-Flag: YES
> X-Spam-Score: 17.374
> X-Spam-Level: *****************
> X-Spam-Status: Yes, score=17.374 tag=-9999 tag2=5 kill=6.31
>         tests=[RCVD_IN_BRBL_LASTEXT=1.644, RCVD_IN_DNSWL_NONE=-0.0001,
>         RCVD_IN_RP_RNBL=1.284, RCVD_IN_SBL_CSS=3.558,
> RCVD_IN_SORBS_WEB=1.5, RP_MATCHES_RCVD=-0.001,
> SUSPICIOUS_RECIPS=2.497, URIBL_ABUSE_SURBL=1.948, URIBL_BLACK=1.7,
> URIBL_DBL_SPAM=2.5, URIBL_SBL=0.644, URIBL_SBL_A=0.1] autolearn=no
> autolearn_force=no
>
> Why aren't my spams getting auto-learned?  If sa-learn "ate" it,
> shouldn't auto-learn too?

To autolearn spam you need 3 points from the body and 3 from headers.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Bayes auto-learn - not happening

Scott
In reply to this post by Scott Techlist
The "3 points" criteria does not apply to manually learning via sa-update then?

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Bayes auto-learn - not happening

Scott
In reply to this post by Benny Pedersen-2
> some of the listed tags have tflags that disable autolearn

< there is nothing to fix here

Benny:  Will you elaborate for me please?  So I can understand and self-help.

Better, what test flags in general disable auto-learn?
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Bayes auto-learn - not happening

Benny Pedersen-2
In reply to this post by Scott
Scott skrev den 2017-08-08 22:04:
> The "3 points" criteria does not apply to manually learning via
> sa-update
> then?

typo ?. sa-update does not learn, it just update rules, you meant
sa-learn ?

when sa-learn is used, its not autolearn, so the limits are not appled
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Bayes auto-learn - not happening

Benny Pedersen-2
In reply to this post by Scott
Scott skrev den 2017-08-08 22:06:

> Better, what test flags in general disable auto-learn?

tflags foo-rule-name noautolearn

and you can force autolearn based on rulename

https://lists.gt.net/spamassassin/users/184996

there is a long thread there that explain it more

and all condition must be met for learning
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Bayes auto-learn - not happening

Scott
In reply to this post by Benny Pedersen-2
Apologies, I meant sa-learn.  Brain fart.

Thanks for the clarification on the 3-point rule.

I've had a bunch of them come through.  They all get autolearn=no or I get a few that say "unavailable" like the sample below.  I gather from trying to figure out myself that unavailable may be things already learned.  Or something else whatever that may be, per the wiki.  But if the database is empty, it seems that "already learned" is not the reason for  "unavailable" in this case anyway.

X-Spam-Status: Yes, score=20.678 tag=-9999 tag2=5 kill=6.4
        tests=[DATE_IN_PAST_03_06=1.076, DCC_CHECK=3.2, DIGEST_MULTIPLE=0.001,
        HTML_EXTRA_CLOSE=0.001, HTML_MESSAGE=0.001,
        HTML_MIME_NO_HTML_TAG=0.635, MIME_HTML_ONLY=1.105, MISSING_MID=0.14,
        NORMAL_HTTP_TO_IP=0.001, RAZOR2_CF_RANGE_51_100=0.365,
        RAZOR2_CF_RANGE_E8_51_100=2.43, RAZOR2_CHECK=2.5, RDNS_NONE=1.274,
        SPF_HELO_SOFTFAIL=3, SPF_SOFTFAIL=3, SUBJ_DOLLARS=0.001,
        URIBL_ABUSE_SURBL=1.948] autolearn=unavailable autolearn_force=no

Does this one have the requisite 3-point match?  I don't understand how to tell yet.

I've cleared the db again.  Will let it run to see if it learns *anything*.  So far I have not seen that happen.  Surely something will get a 3 way match.

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Bayes auto-learn - not happening

RW-15
In reply to this post by Scott
On Tue, 8 Aug 2017 13:04:16 -0700 (MST)
Scott wrote:

> The "3 points" criteria does not apply to manually learning

No it's just a sanity check to reduce mistraining. If you can, don't
use autotraining at all.  
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Bayes auto-learn - not happening

Benny Pedersen-2
In reply to this post by Scott
Scott skrev den 2017-08-08 22:19:

> Does this one have the requisite 3-point match?  I don't understand how
> to
> tell yet.

spamassassin -D 2>&1 -t mail.msg | less

should show why
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Bayes auto-learn - not happening

Scott
Cleared the database, ran below on the same message:

su amavis -c 'spamassassin -D 2>&1 -t onespam' | less

I didn't see any errors obvious to me.  

It recreated the databases and added this message as expected.


I don't know how to tell why it would not have auto-learned.  

Can you tell/ teach me from this?


Content analysis details:   (17.7 points, 5.0 required)

 pts rule name              description
---- ---------------------- --------------------------------------------------
 1.9 URIBL_ABUSE_SURBL      Contains an URL listed in the ABUSE SURBL blocklist
                            [URIs: 145.239.41.28]
 0.0 SUBJ_DOLLARS           Subject starts with dollar amount
 3.0 SPF_HELO_SOFTFAIL      SPF: HELO does not match SPF record (softfail)
 1.1 DATE_IN_PAST_03_06     Date: is 3 to 6 hours before Received: date
 0.0 NORMAL_HTTP_TO_IP      URI: URI host has a public dotted-decimal IPv4
                            address
 0.0 HTML_EXTRA_CLOSE       BODY: HTML contains far too many close tags
 0.0 HTML_MESSAGE           BODY: HTML included in message
 1.1 MIME_HTML_ONLY         BODY: Message only has text/html MIME parts
 3.2 DCC_CHECK              Detected as bulk mail by DCC (dcc-servers.net)
 2.5 RAZOR2_CHECK           Listed in Razor2 (http://razor.sf.net/)
 2.4 RAZOR2_CF_RANGE_E8_51_100 Razor2 gives engine 8 confidence level
                            above 50%
                            [cf: 100]
 0.4 RAZOR2_CF_RANGE_51_100 Razor2 gives confidence level above 50%
                            [cf: 100]
 0.0 DIGEST_MULTIPLE        Message hits more than one network digest check
 0.6 HTML_MIME_NO_HTML_TAG  HTML-only message, but there is no HTML tag
 0.1 MISSING_MID            Missing Message-Id: header
 1.3 RDNS_NONE              Delivered to internal network by a host with no rDNS

Aug  8 15:47:11.098 [17077] dbg: check: tagrun - tag DKIMDOMAIN is still blocking action 0
Aug  8 15:47:11.105 [17077] dbg: plugin: Mail::SpamAssassin::Plugin::MIMEHeader=HASH(0x2ccc328) implements 'finish_tests', priority 0
Aug  8 15:47:11.105 [17077] dbg: plugin: Mail::SpamAssassin::Plugin::Check=HASH(0x2e04e38) implements 'finish_tests', priority 0
Aug  8 15:47:11.116 [17077] dbg: netset: cache trusted_networks hits/attempts: 15/17, 88.2 %



Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Bayes auto-learn - not happening

Scott Techlist
>you need to train your bayes *by hand* to start with - how do you expect
>bayes classification with no hints afetr purge the database - train 200
>ham and spam mails and *after that* look further

Reindl:

Thanks.  I want to use some auto-training with very conservative thresholds set.  All of the messages I've checked would have classified correctly via autolearn comfortably in those ranges.

The 200 threshold is for USING the bayes, but not a auto-learning requirement.  Or that was my clear understanding from many posts.  I saw several old threads where others suggested similar but were corrected.  Maybe they changed it, dunno.

My concern is that auto-learn is not functioning properly.  I use Amavisd that calls spamassassin and has it's own issues.  Trying to make sure my system is operating properly.  It appears it is not to me.

No hint should be necessary for it to learn a spam.  Only to use bayes to score anything.  I get that.  No?




Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Bayes auto-learn - not happening

Scott
In reply to this post by Benny Pedersen-2
Benny:
re tflags
> tflags foo-rule-name noautolearn
> and you can force autolearn based on rulename
> https://lists.gt.net/spamassassin/users/184996
> there is a long thread there that explain it more
>and all condition must be met for learning

I read the thread.  Nothing there concrete enough for my to latch onto.  I mean I get the gist of it, but no details on how to look at my tests and see if I have the requisite 3 parts needed.


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Bayes auto-learn - not happening

Scott
I was getting my commands missed up, been looking at this too long.  When I ran

su amavis -c 'spamassassin -D 2>&1 -t onespam'

That caused it to LEARN the spam.  Database went from not there to one learned.  Auto-learn apparently.  That's what it should have done when it arrived.

Brand new spam arrives.  It gets
autolearn=unavailable.

X-Spam-Status: Yes, score=20.704 tag=-9999 tag2=5 kill=6.4
        tests=[DATE_IN_PAST_06_12=1.103, DCC_CHECK=3.2, DIGEST_MULTIPLE=0.001,
        HTML_EXTRA_CLOSE=0.001, HTML_MESSAGE=0.001,
        HTML_MIME_NO_HTML_TAG=0.635, MIME_HTML_ONLY=1.105, MISSING_MID=0.14,
        NORMAL_HTTP_TO_IP=0.001, RAZOR2_CF_RANGE_51_100=0.365,
        RAZOR2_CF_RANGE_E8_51_100=2.43, RAZOR2_CHECK=2.5, RDNS_NONE=1.274,
        SPF_HELO_SOFTFAIL=3, SPF_SOFTFAIL=3, URIBL_ABUSE_SURBL=1.948]
        autolearn=unavailable autolearn_force=no

That implies no auto-learn because the token exists (or there was something else) as I understand it.  So I try to learn that one spam again...

I had to increase the size limit via:

su amavis -c 'sa-learn -D --spam --showdots  --max-size=6000000 --mbox /home/mail/twospam'

Aug  8 16:35:23.567 [18045] dbg: bayes: learned '419769464db0fabb0f1220f9ae0cf12931ad7076@sa_generated', atime: 1502226537
Learned tokens from 1 message(s) (1 message(s) examined)

At it learned it.  So autolearn=unavailable was NOT due to the token already there.

Is there a size limit built into autolearn?  




Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Bayes auto-learn - not happening

Scott
Another new one  big score, auto-learn disabled.  This one is fairly small.  

X-Spam-Status: Yes, score=29.428 tag=-9999 tag2=5 kill=6.4
        tests=[DATE_IN_PAST_03_06=1.076, DCC_CHECK=3.2, DIGEST_MULTIPLE=0.001,
        FILL_THIS_FORM=0.001, FROM_MISSPACED=0.001, FROM_MISSP_SPF_FAIL=1,
        HEADER_FROM_DIFFERENT_DOMAINS=0.001, HEXHASH_WORD=1,
        HTML_EXTRA_CLOSE=0.001, HTML_MESSAGE=0.001,
        HTML_MIME_NO_HTML_TAG=0.635, MIME_HTML_ONLY=1.105, MISSING_MID=0.14,
        NORMAL_HTTP_TO_IP=0.001, RAZOR2_CF_RANGE_51_100=0.365,
        RAZOR2_CF_RANGE_E8_51_100=2.43, RAZOR2_CHECK=2.5,
        RCVD_IN_BRBL_LASTEXT=1.644, RDNS_NONE=1.274, SPF_FAIL=4,
        SPF_HELO_FAIL=4, STYLE_GIBBERISH=3.093,
        T_HTML_TAG_BALANCE_CENTER=0.01, URIBL_ABUSE_SURBL=1.948,
        WEIRD_QUOTING=0.001] autolearn=unavailable autolearn_force=no

Can you tell if this one has the 3 point match?

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Bayes auto-learn - not happening

Ian Zimmerman-3
On 2017-08-08 15:20, Scott wrote:

> Another new one  big score, auto-learn disabled.  This one is fairly small.  
>
> X-Spam-Status: Yes, score=29.428 tag=-9999 tag2=5 kill=6.4
>         tests=[DATE_IN_PAST_03_06=1.076, DCC_CHECK=3.2,
> DIGEST_MULTIPLE=0.001,
>         FILL_THIS_FORM=0.001, FROM_MISSPACED=0.001, FROM_MISSP_SPF_FAIL=1,
>         HEADER_FROM_DIFFERENT_DOMAINS=0.001, HEXHASH_WORD=1,
>         HTML_EXTRA_CLOSE=0.001, HTML_MESSAGE=0.001,
>         HTML_MIME_NO_HTML_TAG=0.635, MIME_HTML_ONLY=1.105, MISSING_MID=0.14,
>         NORMAL_HTTP_TO_IP=0.001, RAZOR2_CF_RANGE_51_100=0.365,
>         RAZOR2_CF_RANGE_E8_51_100=2.43, RAZOR2_CHECK=2.5,
>         RCVD_IN_BRBL_LASTEXT=1.644, RDNS_NONE=1.274, SPF_FAIL=4,
>         SPF_HELO_FAIL=4, STYLE_GIBBERISH=3.093,
>         T_HTML_TAG_BALANCE_CENTER=0.01, URIBL_ABUSE_SURBL=1.948,
>         WEIRD_QUOTING=0.001] autolearn=unavailable autolearn_force=no
>
> Can you tell if this one has the 3 point match?

Scott,

when I tried to use the autolearn feature I was as confused as you are.
As far as I remember, the 3 point each from header and body is not the
only requirement; the full truth is that some rules are "privileged" and
can contribute to autolearning while others cannot.  I found it opaque
in the extreme and essentially unpredictable, and so I stopped
autolearning and hacked up some scripts that put duplicate of each ham
message into a folder which is then processed by sa-learn from a
cronjob, with sufficient delay that I can review the contents and remove
any false negatives; and similarly with spam, excluding the utterly
horrible category which just goes to /dev/null.

It may not be possible for you to adopt such a process if your volume is
high, but OTOH in that case you probably have users to help you :)

I think this is what RW is telling you, too.

FWIW, this is documented (sort of) by:

perldoc Mail::SpamAssassin::Plugin::AutoLearnThreshold

--
Please don't Cc: me privately on mailing lists and Usenet,
if you also post the followup to the list or newsgroup.
Do obvious transformation on domain to reply privately _only_ on Usenet.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Bayes auto-learn - not happening

John Hardin
On Tue, 8 Aug 2017, Ian Zimmerman wrote:

> I stopped
> autolearning and hacked up some scripts that put duplicate of each ham
> message into a folder which is then processed by sa-learn from a
> cronjob, with sufficient delay that I can review the contents and remove
> any false negatives; and similarly with spam, excluding the utterly
> horrible category which just goes to /dev/null.

This is generally a good idea, unless you have a really high-volume
environment - are you an ISP?

Keeping your training corpora around lets you review it for
misclassifications and retrain very easily if things go off the rails.

Autolearn may be useful once you are initially manually trained. Then you
can focus on manually training the FPs and FNs.

It's also important to be careful what you train with. If you allow users
to submit messages for training (particularly a global bayes) then you
either need to have strong trust in those users' judgement, or review what
they submit before training with it.

--
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  [hidden email]    FALaholic #11174     pgpk -a [hidden email]
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Joan Peterson is like that: you expect at least a pseudological
   argument, but instead you get the weird ramblings of a woman with
   the critical thinking abilities of an 18th century peasant.  -- Ken
-----------------------------------------------------------------------
  7 days until the 72nd anniversary of the end of World War II
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Bayes auto-learn - not happening

Matus UHLAR - fantomas
In reply to this post by Scott
On 08.08.17 14:38, Scott wrote:
>Brand new spam arrives.  It gets
>autolearn=unavailable.
[...]
>su amavis -c 'sa-learn -D --spam --showdots  --max-size=6000000 --mbox
>/home/mail/twospam'
>
>Aug  8 16:35:23.567 [18045] dbg: bayes: learned
>'419769464db0fabb0f1220f9ae0cf12931ad7076@sa_generated', atime: 1502226537
>Learned tokens from 1 message(s) (1 message(s) examined)
>
>At it learned it.  So autolearn=unavailable was NOT due to the token already
>there.

autolearn=unavailable apparently due to not accessible bayes database.

try running "ls -la ~amavis/.spamassassin/" - apparently permissions make
the directory or files in it unwritable for amavis user.
--
Matus UHLAR - fantomas, [hidden email] ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Atheism is a non-prophet organization.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Bayes auto-learn - not happening

David Jones
In reply to this post by Ian Zimmerman-3
On 08/08/2017 08:02 PM, Ian Zimmerman wrote:

> On 2017-08-08 15:20, Scott wrote:
>
>> Another new one  big score, auto-learn disabled.  This one is fairly small.
>>
>> X-Spam-Status: Yes, score=29.428 tag=-9999 tag2=5 kill=6.4
>>          tests=[DATE_IN_PAST_03_06=1.076, DCC_CHECK=3.2,
>> DIGEST_MULTIPLE=0.001,
>>          FILL_THIS_FORM=0.001, FROM_MISSPACED=0.001, FROM_MISSP_SPF_FAIL=1,
>>          HEADER_FROM_DIFFERENT_DOMAINS=0.001, HEXHASH_WORD=1,
>>          HTML_EXTRA_CLOSE=0.001, HTML_MESSAGE=0.001,
>>          HTML_MIME_NO_HTML_TAG=0.635, MIME_HTML_ONLY=1.105, MISSING_MID=0.14,
>>          NORMAL_HTTP_TO_IP=0.001, RAZOR2_CF_RANGE_51_100=0.365,
>>          RAZOR2_CF_RANGE_E8_51_100=2.43, RAZOR2_CHECK=2.5,
>>          RCVD_IN_BRBL_LASTEXT=1.644, RDNS_NONE=1.274, SPF_FAIL=4,
>>          SPF_HELO_FAIL=4, STYLE_GIBBERISH=3.093,
>>          T_HTML_TAG_BALANCE_CENTER=0.01, URIBL_ABUSE_SURBL=1.948,
>>          WEIRD_QUOTING=0.001] autolearn=unavailable autolearn_force=no
>>
>> Can you tell if this one has the 3 point match?
>
> Scott,
>
> when I tried to use the autolearn feature I was as confused as you are.
> As far as I remember, the 3 point each from header and body is not the
> only requirement; the full truth is that some rules are "privileged" and
> can contribute to autolearning while others cannot.  I found it opaque
> in the extreme and essentially unpredictable, and so I stopped
> autolearning and hacked up some scripts that put duplicate of each ham
> message into a folder which is then processed by sa-learn from a
> cronjob, with sufficient delay that I can review the contents and remove
> any false negatives; and similarly with spam, excluding the utterly
> horrible category which just goes to /dev/null.
>
> It may not be possible for you to adopt such a process if your volume is
> high, but OTOH in that case you probably have users to help you :)
>
> I think this is what RW is telling you, too.
>
> FWIW, this is documented (sort of) by:
>
> perldoc Mail::SpamAssassin::Plugin::AutoLearnThreshold
>

Same here.  I had a little success with autolearn.  When I started
splitting out messages into a spam and ham folder and using a cron
script to train explicitly, the BAYES hits became very accurate and
helped with zero-hour spam which is the hardest to block.

I setup an iRedmail server on a local-only subdomain and send/BCC copies
of messages over to it.  Then I can use simple Inbox rules to sort or
discard them.  Then I cron'd spam and ham training based on the Maildir
"cur" folders.  This requires me to do a quick scan of the unread
messages.  When I mark them as read, then they get sa-learn'd.  Takes a
few minutes a day and drastically improved the mail filtering.

A side effect of this has allowed me to easily spot some new spam
campaigns and messages that are scoring just below the block threshold
so I can add them to local custom rules.  Sometimes these are legit
senders with good opt-out so I add them to a whitelist_auth entry.

--
David Jones
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Bayes auto-learn - not happening

Scott
In reply to this post by Scott Techlist
Update:  Still NOT working, but I'm giving it hell trying to figure out why :)

First a couple of answers to other's questions:
- John, others, not an ISP, high is relative I'm sure but the volume is much higher than I can duplicate and review every flagged message.  Right now running at about 10% before I migrate one of my larger domains.  Mail is relayed to exchange servers.  Users do not have imap accounts on box.  A few local users with POP only.  I don't configure or allow anyone to  submit messages for training directly.

- re no, or careful auto-training.  I get it.  I'm migrating from a server that's run for years with auto-learn on set at conservative learn values.  Never had any trouble with it thank goodness.  As I look at the messages that would be autolearned, I've never found one that would have learned that should not have in my corpus.  The volume would just be too high to personally go through each one of them myself.  I have had "problem" users that get a lot of spam misses and I plan to set up a way for them to submit their spam to me (not autolearn) for review and manual training as needed.  

- Matus:  re:" autolearn=unavailable apparently due to not accessible bayes database [due to permissions]".  I hope you are right.  That would make sense to me.  See below please.  I think I listed them all.  Config and permissions look good to me, I'm grateful to have anything I missed pointed out by an experienced eye.

My old server, running embarrassingly old versions of everything works great.  So the auto-learn in general has been a good fit for my environment.  I get it that it's not for everyone.  But a tleast it SHOULD work, and let me choose to tweak it or turn it off.  As far as I can tell it is not working, at all.

So here's where I am:

1.  I stepped back and went through all my configurations carefully.  spamassassin is being run via amavisd, as the amavis user.  Site wide config, no other users have direct access.  POP accounts and relay accounts only.

2.  From prior research before asking for help, I understood no spam was necessary for auto-learn to work but one person here said I had to be at the minimum (200 default) before it would.  So, to rule that out as the issue, I manually fed it plenty of spam and ham.  For others who might read this thread archived, I was having trouble getting enough learned due to the default size limit my version of SA/sa-learn had.  With some digging I found out how to raise that limit and then I had plenty of spam to feed:
su amavis -c 'sa-learn -D --spam --showdots --max-size=1000000 --mbox /home/mail/spam'

[root@mail2 amavisd]# su amavis -c 'sa-learn --dump magic'
0.000          0          3          0  non-token data: bayes db version
0.000          0        349          0  non-token data: nspam
0.000          0        478          0  non-token data: nham
0.000          0     166030          0  non-token data: ntokens
0.000          0 1501594564          0  non-token data: oldest atime
0.000          0 1502289189          0  non-token data: newest atime

3.  Next up were questions about the config and permissions.  I checked my setup, it looked OK, but I even opened some directories up 777 for testing
This is my config, I'd be grateful if anyone sees anything wrong point it out:
I include the amavis stuff just to show it is running and invoked as and by amavis user

3a. amavis
in /usr/lib/systemd/system/amavisd.service
User=amavis
Group=amavis
ExecStart=/usr/sbin/amavisd -c /etc/amavisd/amavisd.conf

> amavis user's home dir per /etc/passwd is:
/var/spool/amavisd
verified with cd ~amavis

3b. local.cf
> My spamassassin local.cf is at:
/etc/mail/spamassassin/local.cf

> verified this is the one being used by putting an error
> line and restarting amavisd.  It compalins about the error.  
> Fixed of cousre and continue...

> in local.cf I have these related settings:
use_bayes               1
bayes_auto_learn        1
bayes_auto_learn_threshold_nonspam -1.7
bayes_auto_learn_threshold_spam 10.0
bayes_path              /etc/mail/bayes/bayes
bayes_file_mode         0777

3c. bayes
> for troubleshooting I set the permissions to 777 on /etc/mail/bayes and it's files
> This is the only occurrence of the "bayes" files on the server
[root@mail2 amavisd]# ls -la /etc/mail/bayes
total 4196
drwxrwxrwx 2 amavis amavis    4096 Aug  9 13:49 .
drwxr-xr-x 4 amavis amavis    4096 Aug  3 13:02 ..
-rwxrwxrwx 1 amavis amavis   86016 Aug  9 09:51 bayes_seen
-rwxrwxrwx 1 amavis amavis 5246976 Aug  9 13:49 bayes_toks

3d. amavis spamassassin folder settings
> For amavis which is calling spamassassin via it's
> perl libraries (I am not running spamd),
> I have it's related configuration parts as:
$MYHOME = '/var/spool/amavisd';   # a convenient default for other settings, -H
$TEMPBASE = "$MYHOME/tmp";   # working directory, needs to exist, -T
$ENV{TMPDIR} = $TEMPBASE;    # environment variable TMPDIR, used by SA, etc.
$db_home   = "$MYHOME/db";        # dir for bdb nanny/cache/snmp databases, -D
#$helpers_home = "$MYHOME/var";  # working directory for SpamAssassin, -S
$helpers_home = "$MYHOME";  # working directory for SpamAssassin, -S

3e. spamassassin directory
> And for spamassassin, it's files are being placed in the amavisd home directory as configured in amavisd.conf.
> I am careful to only run sa-update, or SA debug commands as amavisd user so as not to create any other
> .spamassassin folders under root, etc.
> this is the only occurrence of .spamassassin on the server:
[root@mail2 amavisd]# locate .spamassassin
/var/spool/amavisd/.spamassassin
/var/spool/amavisd/.spamassassin/user_prefs

3f. amavis (spamassassin's user) home directory
[root@mail2 amavisd]# ls -la /var/spool/amavisd
total 32
drwxr-x--- 6 amavis amavis 4096 Aug  9 20:49 .
drwxr-xr-x 8 root   root   4096 Nov  5  2016 ..
-rw------- 1 amavis amavis  101 Aug  9 11:17 .bash_history
-rw-r--r-- 1 amavis amavis    0 Aug  9 20:49 black.lst
drwxr-x--- 2 amavis amavis 4096 Aug  9 20:30 db
drwxr-x--- 2 amavis amavis 4096 Apr 19 07:28 quarantine
drwx------ 2 amavis amavis 4096 Aug  8 15:32 .spamassassin
drwxr-x--- 5 amavis amavis 4096 Aug 10 08:26 tmp
-rw-r--r-- 1 amavis amavis   37 Aug  7 19:28 white.lst

3g.  .spamassassin folder
[root@mail2 amavisd]# ls -la /var/spool/amavisd/.spamassassin
total 12
drwx------ 2 amavis amavis 4096 Aug  8 15:32 .
drwxr-x--- 6 amavis amavis 4096 Aug  9 20:49 ..
-rw-r--r-- 1 amavis amavis 1869 Aug  8 15:32 user_prefs


4. Logging
I managed to get Amavisd configured to let the more verbose rule listing for the header, and score details in the log come through for my troubleshooting as well.

5, results:

After running this config now, with a loaded bayes database, it has yet to auto-learn a single spam (or ham).  Just through yesterday my spam quarantine has over 50 pretty high scoring spams in it.  I've studied tflags and now understand what they are (for others here's a good link):
http://commons.oreilly.com/wiki/index.php/SpamAssassin/SpamAssassin_Rules

I understand SA requires at least 3 points from the header and 3 points from the body, to auto-learn as spam.  I understand some tflags preclude the use of the test in the autolearn score.  I understand bayes points don't count.  But surely one of the 50 high scores I caught yesterday qualified.  Yet, no autolearn.  Always autolearn=unavailable or no.  I've turned on verbose debugging for bayes but I don't see any errors or feedback on reasons for the no-learn.

Looked at yesterday's log:

cat /var/log/maillog.1|grep autolearn=unavailable|wc -l
60

Now amavisd has the option of giving a verbose log line with all the score stuff.  Now amavis adds a "autolearn score" to the log as well.  Not sure how that is calculated, but it's interesting anyway.  Be great if it were h/b/t (header/body/total).  Anyway, sample:

Aug 10 00:38:39 mail2 amavis[15959]: (15959-08) Blocked SPAM {DiscardedInbound,Quarantined}, [89.43.62.101]:47955 [89.43.62.101] ESMTP/LMTP <[hidden email]> -> <[hidden email]>, (ESMTP://[89.43.62.101]:47955), quarantine: [hidden email], Queue-ID: 7F64A70, mail_id: yxtV5c7b1N8r, b: tDtWV84sR, Hits: 23.553, size: 365419, Subject: "Joanna Gaines Drops Bombshell.", From: <[hidden email]>, helo=hewis.versateye.com, Tests: [BAYES_999=0.2,BAYES_99=3.5,DATE_IN_PAST_03_06=1.592,DCC_CHECK=3.2,DIGEST_MULTIPLE=0.293,HTML_MESSAGE=0.001,HTML_MIME_NO_HTML_TAG=0.377,MIME_HTML_ONLY=0.723,MISSING_MID=0.497,NORMAL_HTTP_TO_IP=0.001,RAZOR2_CF_RANGE_51_100=0.5,RAZOR2_CF_RANGE_E8_51_100=1.886,RAZOR2_CHECK=2.5,RCVD_IN_BRBL_LASTEXT=1.449,RDNS_NONE=0.793,SPF_HELO_PASS=-0.001,SPF_PASS=-0.001,STYLE_GIBBERISH=3.093,URIBL_ABUSE_SURBL=1.25,URIBL_BLACK=1.7], autolearn=unavailable autolearn_force=no, autolearnscore=21.113, 5061 ms

As usual, autolearn=unavailable.  

My suspicion is many of those "unavailable" should have been a learn.  Surely out of 60, one was valid to autolearn.

I don't know what to look for next to troubleshoot.  Sure hoping it's just a permissions issue.

I'm back to a brick wall.  How can I help you help me?  


123
Loading...