[Bug 6389] New: FPs on DOS_HIGHBIT_HDRS_BODY

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

[Bug 6389] New: FPs on DOS_HIGHBIT_HDRS_BODY

bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6389

           Summary: FPs on DOS_HIGHBIT_HDRS_BODY
           Product: Spamassassin
           Version: 3.3.1
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P5
         Component: Rules
        AssignedTo: [hidden email]
        ReportedBy: [hidden email]


Created an attachment (id=4721)
 --> (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4721)
Sample FP message

I've seen a few FPs on this rule from genuine ham sent by one of my colleagues
using Thunderbird 3.0.4 - not all her mail, but specifically replies to certain
messages with UTF-8 encoding.

I'm seeing very few occurrences, but given the high default score of this rule
(3.8) I've set the severity to normal rather than minor.

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug 6389] FPs on DOS_HIGHBIT_HDRS_BODY

bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6389

--- Comment #1 from [hidden email] 2010-04-01 15:15:13 UTC ---
Created an attachment (id=4730)
 --> (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4730)
a very simple test email written in Chinese that triggers DOS_HIGHBIT_HDRS_BODY

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug 6389] FPs on DOS_HIGHBIT_HDRS_BODY

bugzilla-daemon
In reply to this post by bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6389

[hidden email] changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |[hidden email]

--- Comment #2 from [hidden email] 2010-04-01 15:21:32 UTC ---
I would say, this rule is unfriendly to non-English email. Attached an email
written in Chinese, which From header contains "李耀宗", which is my Chinese
name), with both Subject and content are "這是測試" (meaning "This is a test").

I have to consider to disable this rule completely in my mail server. Totally
unfriendly to Chinese. There are similar non-English unfriendly rules, as in
bug 5859 (which is reported by me too, and still not fixed for 2 years.)

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug 6389] FPs on DOS_HIGHBIT_HDRS_BODY

bugzilla-daemon
In reply to this post by bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6389

--- Comment #3 from [hidden email] 2010-04-01 15:59:26 UTC ---
(In reply to comment #2)
> I would say, this rule is unfriendly to non-English email. Attached an email
> written in Chinese, which From header contains "李耀宗", which is my Chinese
> name), with both Subject and content are "這是測試" (meaning "This is a test").
>
> I have to consider to disable this rule completely in my mail server. Totally
> unfriendly to Chinese. There are similar non-English unfriendly rules, as in
> bug 5859 (which is reported by me too, and still not fixed for 2 years.)

Further comment: this rule would be triggered on pratically all emails written
in Chinese/Japanese/Korean (or some multibyte charset) (which would probably
contains mentioned characters in From/Subject headers. I would urge that this
rule should be removed.

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug 6389] FPs on DOS_HIGHBIT_HDRS_BODY

bugzilla-daemon
In reply to this post by bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6389

Adam Katz <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |[hidden email]

--- Comment #4 from Adam Katz <[hidden email]> 2010-04-06 23:35:26 UTC ---
Regarding comment 0 and its sample FP attachment 4721, it looks like that
should have been ALL_TRUSTED (see the documentation for internal_networks).
While this doesn't solve the bug, it would help alleviate the
spammy-messages-from-colleagues problem.

Hm.  This header from attachment 4730 is quite interesting:

X-MIME-Autoconverted: from quoted-printable to 8bit by popo.ctimail.com id
o31FCcI16161

I believe this is reporting that ctimail's mail system converted the
quoted-printable headers to 8bit, which triggered the rule.  Plugging that
header into google shows 19k hits, which is small but not intangible.  Even my
own sendmail server has added it in the past.  Comparative data: X-Spam-Status
(236k), X--MailScanner (10k), X-Spam-Flag (27k), X-Greylist (17k), X-X-Sender
(9k), X-Sieve (7k), X-Received (16k) ... (searches performed in quotes with a
second query being "Message-ID" to ensure we're looking at email headers).

I've placed a possible fix into our QA system (20_bug_6389.cf in my sandbox) to
sanity-check it, containing the following code (the first rule is just a
popularity test for that header):

header __HAS_XMIME_AUTOCONV     exists:X-MIME-Autoconverted
header __MIME_QP_TO_8BIT X-MIME-Autoconverted =~ /from quoted-printable to
8bit/
meta DOS_HIGHBIT_HDRS_BODY_BUG6389 __FROM_NEEDS_MIME && __SUBJECT_ENCODED_B64
&& __FROM_ENCODED_B64 && __SUBJECT_NEEDS_MIME && __HIGHBITS &&
!__MIME_QP_TO_8BIT

Sadly, this doesn't help the first sample. Appending "&& !__RCVD_VIA_APNIC_LE"
would also fail to solve it since it is from France and not Asia. According to
yesterday's numbers, that extra requirement would also reduce the spam hit by
43% and the ham by under 20%, reducing 1.1268% spam to 0.7423% and the ham to
somewhere between 0.0261% and the current 0.0326%.

I'm disheartened by the French FP as it was composed with the latest version of
Thunderbird (3.0.4, WinXP, French build), but at least configuring
internal_networks would solve it for that particular user's internal company
mail.  For a full fix, I can think of nothing but removing this rule.  The
question becomes:  how many FPs does this rule really create, i.e. is this an
isolated incident?

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug 6389] FPs on DOS_HIGHBIT_HDRS_BODY

bugzilla-daemon
In reply to this post by bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6389

--- Comment #5 from [hidden email] 2010-04-07 06:20:44 UTC ---
(In reply to comment #4)

> Regarding comment 0 and its sample FP attachment 4721 [details], it looks like that
> should have been ALL_TRUSTED (see the documentation for internal_networks).
> While this doesn't solve the bug, it would help alleviate the
> spammy-messages-from-colleagues problem.
>
> Hm.  This header from attachment 4730 [details] is quite interesting:
>
> X-MIME-Autoconverted: from quoted-printable to 8bit by popo.ctimail.com id
> o31FCcI16161
>
> I believe this is reporting that ctimail's mail system converted the
> quoted-printable headers to 8bit, which triggered the rule.  Plugging that
> header into google shows 19k hits, which is small but not intangible.  Even my
> own sendmail server has added it in the past.  Comparative data: X-Spam-Status
> (236k), X--MailScanner (10k), X-Spam-Flag (27k), X-Greylist (17k), X-X-Sender
> (9k), X-Sieve (7k), X-Received (16k) ... (searches performed in quotes with a
> second query being "Message-ID" to ensure we're looking at email headers).
>
> I've placed a possible fix into our QA system (20_bug_6389.cf in my sandbox) to
> sanity-check it, containing the following code (the first rule is just a
> popularity test for that header):
>
> header __HAS_XMIME_AUTOCONV     exists:X-MIME-Autoconverted
> header __MIME_QP_TO_8BIT X-MIME-Autoconverted =~ /from quoted-printable to
> 8bit/
> meta DOS_HIGHBIT_HDRS_BODY_BUG6389 __FROM_NEEDS_MIME && __SUBJECT_ENCODED_B64
> && __FROM_ENCODED_B64 && __SUBJECT_NEEDS_MIME && __HIGHBITS &&
> !__MIME_QP_TO_8BIT
>
> Sadly, this doesn't help the first sample. Appending "&& !__RCVD_VIA_APNIC_LE"
> would also fail to solve it since it is from France and not Asia. According to
> yesterday's numbers, that extra requirement would also reduce the spam hit by
> 43% and the ham by under 20%, reducing 1.1268% spam to 0.7423% and the ham to
> somewhere between 0.0261% and the current 0.0326%.
>
> I'm disheartened by the French FP as it was composed with the latest version of
> Thunderbird (3.0.4, WinXP, French build), but at least configuring
> internal_networks would solve it for that particular user's internal company
> mail.  For a full fix, I can think of nothing but removing this rule.  The
> question becomes:  how many FPs does this rule really create, i.e. is this an
> isolated incident?


According to my email sample (attachment 4730), the email is scanned by
SpamAssassin before QP-to-8bit conversion (note the mail id o31FCcI16161)

Received: from smtp1o.ctimail.com (smtp1 [203.186.94.57])
    by popo.ctimail.com (8.11.1/8.11.1) with ESMTP id o31FCcI16161
    for <[hidden email]>; Thu, 1 Apr 2010 23:12:38 +0800 (CST)
Received: from iguard1-206.hkbn.net (iguard1-206.hkbn.net [203.186.220.206])
    by smtp1o.ctimail.com (8.12.11/8.12.11) with ESMTP id o31FCalG014728
    for <[hidden email]>; Thu, 1 Apr 2010 23:12:38 +0800 (HKT)
Received: from violet.alumni.cuhk.net ([202.45.188.23])
  by iguard1.hkbn.net with ESMTP; 01 Apr 2010 23:12:37 +0800
Received: from asavgw1.alumni.cuhk.net (asavgw1.alumni.cuhk.net
[202.45.188.44])
    by violet.alumni.cuhk.net (8.14.3/8.14.3) with ESMTP id o31FCUvr000701
    for <[hidden email]>; Thu, 1 Apr 2010 23:12:31 +0800
Received: from ieaa.ie.cuhk.edu.hk ([137.189.97.6])
  by asavgw1.alumni.cuhk.net with ESMTP; 01 Apr 2010 23:12:36 +0800
Received: from smtp.ctimail.com ([203.186.94.58] helo=smtpo.ctimail.com)
    by ieaa.ie.cuhk.edu.hk with esmtp (Exim 4.63)
    (envelope-from <[hidden email]>)
    id 1NxM4R-0006GD-8l
    for [hidden email]; Thu, 01 Apr 2010 23:12:36 +0800
Received: from [127.0.0.1] (119247234247.ctinets.com [119.247.234.247])
    by smtpo.ctimail.com (8.12.11/8.12.11) with ESMTP id o31FCROw020860
    for <[hidden email]>; Thu, 1 Apr 2010 23:12:27 +0800 (HKT)
X-MIME-Autoconverted: from quoted-printable to 8bit by popo.ctimail.com id
o31FCcI16161

I would say, the real bug should be in 20_html_tests.cf, which says

body __HIGHBITS                     /(?:[\x80-\xff].?){4}/

I think it should be
rawbody __HIGHBITS                     /(?:[\x80-\xff].?){4}/

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug 6389] FPs on DOS_HIGHBIT_HDRS_BODY

bugzilla-daemon
In reply to this post by bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6389

--- Comment #6 from [hidden email] 2010-04-07 06:25:04 UTC ---
> I would say, the real bug should be in 20_html_tests.cf, which says
>
> body __HIGHBITS                     /(?:[\x80-\xff].?){4}/
>
> I think it should be
> rawbody __HIGHBITS                     /(?:[\x80-\xff].?){4}/

Sorry, rawbody doesn't fix either. According to

http://spamassassin.apache.org/full/3.1.x/doc/Mail_SpamAssassin_Conf.html

"The 'raw body' of a message is the raw data inside all textual parts. The text
will be decoded from base64 or quoted-printable encoding..."

That is, almost all non-English would be triggered by this rule...

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug 6389] FPs on DOS_HIGHBIT_HDRS_BODY

bugzilla-daemon
In reply to this post by bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6389

--- Comment #7 from John Wilcock <[hidden email]> 2010-04-07 07:39:55 UTC ---
Created an attachment (id=4735)
 --> (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4735)
Another FP

My first FP sample was indeed "saved" by ALL_TRUSTED (and BAYES_00).

Here's another one, an opt-in newsletter that was only saved by
RCVD_IN_RP_CERTIFIED and RCVD_IN_RP_SAFE (it also had a valid DKIM signature,
which I've no doubt invalidated by obfuscating the recipient's address).

Any messages from people with highbit characters in their names, and with
highbit characters in the subject and body will potentially hit the rule - and
that is inevitably a fairly common scenario for non-English mail.

What appears to be saving this rule from more FPs is the check for base64
encoding of the headers. Thunderbird, for instance, appears to use
quoted-printable encoding for ISO-8859-1, its default charset, and only
switches to base64 for UTF-8 and other multibyte charsets.

Does the rule actually hit much spam that wouldn't be caught otherwise? On my
two low-volume servers I have only one spam hit that would have scored under 10
points without this rule, and none that would have been FNs.

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug 6389] FPs on DOS_HIGHBIT_HDRS_BODY

bugzilla-daemon
In reply to this post by bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6389

John Wilcock <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |[hidden email]

--- Comment #8 from John Wilcock <[hidden email]> 2010-04-07 08:18:50 UTC ---
(In reply to comment #7)
> What appears to be saving this rule from more FPs is the check for base64
> encoding of the headers. Thunderbird, for instance, appears to use
> quoted-printable encoding for ISO-8859-1, its default charset, and only
> switches to base64 for UTF-8 and other multibyte charsets.

Thinking a bit more about this, could this be the basis for reducing the FP
rate of the rule?

Something like the following:

header __FROM_1BYTE_B64 From:raw =~
/=\?(?:iso-8859-1?\d|windows-125\d|koi-8r?)\?B\?/i
header __SUBJ_1BYTE_B64 Subject:raw =~
/=\?(?:iso-8859-1?\d|windows-125\d|koi-8r?)\?B\?/i

meta DOS_HIGHBIT_HDRS_BODY_BUG6389 __FROM_NEEDS_MIME && __SUBJ_1BYTE_B64 &&
__FROM_1BYTE_B64 && __SUBJECT_NEEDS_MIME && __HIGHBITS && !__MIME_QP_TO_8BIT

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug 6389] FPs on DOS_HIGHBIT_HDRS_BODY

bugzilla-daemon
In reply to this post by bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6389

--- Comment #9 from [hidden email] 2010-04-07 11:20:22 UTC ---
(In reply to comment #8)

> Thinking a bit more about this, could this be the basis for reducing the FP
> rate of the rule?
>
> Something like the following:
>
> header __FROM_1BYTE_B64 From:raw =~
> /=\?(?:iso-8859-1?\d|windows-125\d|koi-8r?)\?B\?/i
> header __SUBJ_1BYTE_B64 Subject:raw =~
> /=\?(?:iso-8859-1?\d|windows-125\d|koi-8r?)\?B\?/i
>
> meta DOS_HIGHBIT_HDRS_BODY_BUG6389 __FROM_NEEDS_MIME && __SUBJ_1BYTE_B64 &&
> __FROM_1BYTE_B64 && __SUBJECT_NEEDS_MIME && __HIGHBITS && !__MIME_QP_TO_8BIT

In my opinion, __HIGHBITS is fundamentally flawed that should never exist.
No matter using "body" rule or "rawbody" rule the message body is always
decoded from base64 or quoted-printable. And __HIGHBITS matches streams of high
bits octets (to be exact, [one high bit octet + any character] repeated 4
times), which is always the case for East Asian languagues.

Is there any way to check message body but doesn't perform base64 or
quoted-printable encoding?

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug 6389] FPs on DOS_HIGHBIT_HDRS_BODY

bugzilla-daemon
In reply to this post by bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6389

Justin Mason <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |[hidden email]

--- Comment #10 from Justin Mason <[hidden email]> 2010-04-07 11:53:33 UTC ---
(In reply to comment #9)
> In my opinion, __HIGHBITS is fundamentally flawed that should never exist.
> No matter using "body" rule or "rawbody" rule the message body is always
> decoded from base64 or quoted-printable. And __HIGHBITS matches streams of high
> bits octets (to be exact, [one high bit octet + any character] repeated 4
> times), which is always the case for East Asian languagues.

hi -- as far as I know, the intention of __HIGHBITS is to *detect* such
charsets as the East Asian ones, so that it can be used in meta rules to avoid
false positives.

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug 6389] FPs on DOS_HIGHBIT_HDRS_BODY

bugzilla-daemon
In reply to this post by bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6389

--- Comment #11 from [hidden email] 2010-04-07 12:37:36 UTC ---

> hi -- as far as I know, the intention of __HIGHBITS is to *detect* such
> charsets as the East Asian ones, so that it can be used in meta rules to avoid
> false positives.

Understand. Then it is just DOS_HIGHBIT_HDRS_BODY flawed, not __HIGHBITS.

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug 6389] FPs on DOS_HIGHBIT_HDRS_BODY

bugzilla-daemon
In reply to this post by bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6389

--- Comment #12 from Daryl C. W. O'Shea <[hidden email]> 2010-04-09 03:43:59 UTC ---
I agree, the rule looks bad.  I've commented it out.  It should disappear from
updates this weekend.

[dos@cyan dos]$ svn ci -m "bug 6389: comment out DOS_HIGHBIT_HDRS_BODY due to
FPs"
Authentication realm: <https://svn.apache.org:443> ASF Committers
Password for 'dos':
Sending        dos/70_other.cf
Transmitting file data .
Committed revision 932235.

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug 6389] FPs on DOS_HIGHBIT_HDRS_BODY

bugzilla-daemon
In reply to this post by bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6389

Daryl C. W. O'Shea <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED

--- Comment #13 from Daryl C. W. O'Shea <[hidden email]> 2010-04-09 03:44:29 UTC ---
Closing as fixed.

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug 6389] FPs on DOS_HIGHBIT_HDRS_BODY

Bugzilla from bugzilla-daemon@issues.apache.org
In reply to this post by bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6389

--- Comment #15 from Adam Katz <[hidden email]> 2010-04-12 18:50:51 EDT ---
Just a follow-up because I had some investigations running when this was
closed...

Rules
------------------
# From rulesrc/sandbox/khopesh/20_bug_6389.cf on trunk at r932438
#
http://svn.apache.org/viewvc/spamassassin/trunk/rulesrc/sandbox/khopesh/20_bug_6389.cf?revision=932438&view=markup

# just a raw numbers check:
header __HAS_XMIME_AUTOCONV exists:X-MIME-Autoconverted
tflags __HAS_XMIME_AUTOCONV nice

# possible fix to bug 6389
header __MIME_QP_TO_8BIT X-MIME-Autoconverted =~ /from quoted-printable to
8bit/
tflags __MIME_QP_TO_8BIT nice

# John Wilcock's proposed subtitutions for __..._ENCODED_B64 (comment 8)
header __FROM_1BYTE_B64 From:raw =~
/=\?(?:iso-8859-1?\d|windows-125\d|koi-8r?)\?B\?/i
header __SUBJ_1BYTE_B64 Subject:raw =~
/=\?(?:iso-8859-1?\d|windows-125\d|koi-8r?)\?B\?/i

meta DOS_HIGHBIT_HDRS_BODY_BUG6389 __FROM_NEEDS_MIME && __SUBJECT_ENCODED_B64
&& __FROM_ENCODED_B64 && __SUBJECT_NEEDS_MIME && __HIGHBITS &&
!__MIME_QP_TO_8BIT

# Daryl O'Shea (DOS) + Adam Katz (KHOP) + John Wilcock version
meta FROM_SUBJ_BODY_8BIT __FROM_NEEDS_MIME && __SUBJ_1BYTE_B64 &&
__FROM_1BYTE_B64 && __SUBJECT_NEEDS_MIME && __HIGHBITS && !__MIME_QP_TO_8BIT

# assuming recipients won't also be highbit'd ("highbitten?")
header __TO_1BYTE_B64 To:raw =~
/=\?(?:iso-8859-1?\d|windows-125\d|koi-8r?)\?B\?/i
meta FROM_SUBJ_NOTO_BODY_8BIT __FROM_NEEDS_MIME && __SUBJ_1BYTE_B64 &&
__FROM_1BYTE_B64 && __SUBJECT_NEEDS_MIME && __HIGHBITS && !__MIME_QP_TO_8BIT &&
!__TO_1BYTE_B64


Results from 2010-04-11 (non-net run)
------------------

http://ruleqa.spamassassin.org/20100411-r932853-n/%2FDOS_HIGHB|MIME_QP_TO_|HAS_XMIME_|_1BYTE_B64|_ENCODED_B64|FROM_SUBJ_

  SPAM%     HAM%     S/O    RANK   SCORE  NAME
 1.1775   0.0359   0.970    0.82    0.01  T_DOS_HIGHBIT_HDRS_BODY_BUG6389
 0.0718   0.0021   0.972    0.66    0.01  T_FROM_SUBJ_BODY_8BIT
 0.0714   0.0021   0.972    0.66    0.01  T_FROM_SUBJ_NOTO_BODY_8BIT
 0.5069   0.2155   0.702    0.62   (n/a)  __SUBJ_1BYTE_B64
 0.0928   0.1333   0.410    0.53   (n/a)  __FROM_1BYTE_B64
 2.3337   2.3339   0.500    0.51   (n/a)  __SUBJECT_ENCODED_B64
 1.3552   1.7032   0.443    0.50   (n/a)  __FROM_ENCODED_B64
 0.0004   0.1519   0.003    0.31   (n/a)  __TO_1BYTE_B64
 6.2081   1.0613   0.854    0.24   (n/a)  __HAS_XMIME_AUTOCONV
 6.1458   0.9837   0.862    0.24   (n/a)  __MIME_QP_TO_8BIT

That rules out the suggestions from comment 8.  Because Daryl removed the
original rule, it's not listed here, but my modification did little to nothing.

A breakdown of T_DOS_HIGHBIT_HDRS_BODY_BUG6389 scores:

  scoremap  ham:  0  79.31%   69 *******************************
  scoremap  ham:  1   3.45%    3 *
  scoremap  ham:  2  16.09%   14 ******
  scoremap  ham:  3   1.15%    1
  scoremap spam:  0   2.85%  413 *
  scoremap spam:  1   0.15%   22
  scoremap spam:  2  18.89% 2734 *******
  scoremap spam:  3   3.70%  536 *
  scoremap spam:  4   4.40%  637 *
  scoremap spam:  5  12.40% 1794 ****
  scoremap spam:  6   5.51%  797 **
  scoremap spam:  7   7.81% 1130 ***
  scoremap spam:  8  10.22% 1479 ****
  scoremap spam:  9   5.66%  819 **
  scoremap spam: 10   7.17% 1037 **
  scoremap spam: 11   5.80%  839 **
  scoremap spam: 12   4.35%  629 *
  scoremap spam: 13   2.74%  396 *
  scoremap spam: 14   2.64%  382 *
  scoremap spam: 15   1.53%  221
  scoremap spam: 16   1.29%  187
  scoremap spam: 17   0.98%  142
  scoremap spam: 18   0.53%   76
  scoremap spam: 19   0.53%   76
  scoremap spam: 20   0.27%   39
  scoremap spam: 21   0.20%   29
  scoremap spam: 22   0.12%   17
  scoremap spam: 23   0.08%   12
  scoremap spam: 24   0.10%   15
  scoremap spam: 25   0.01%    2
  scoremap spam: 26   0.01%    2
  scoremap spam: 28   0.02%    3
  scoremap spam: 29   0.01%    2
  scoremap spam: 30   0.01%    1
  scoremap spam: 32   0.02%    3
  scoremap spam: 33   0.01%    1

Overlap Spam (50% and up)
  x%  of this rule x             also hit this rule y,     y% of y also hit x
 76%  T_DOS_HIGHBIT_HDRS...6389  T_FSL_HELO_NON_FQDN_2     1%
 72%  T_DOS_HIGHBIT_HDRS...6389  RCVD_IN_PBL               1%
 68%  T_DOS_HIGHBIT_HDRS...6389  RCVD_IN_XBL               1%
 55%  T_DOS_HIGHBIT_HDRS...6389  RAZOR2_CHECK              0%
 53%  T_DOS_HIGHBIT_HDRS...6389  RAZOR2_CF_RANGE_51_100    0%
 53%  T_DOS_HIGHBIT_HDRS...6389  RDNS_NONE                 1%
 51%  T_DOS_HIGHBIT_HDRS...6389  RCVD_IN_BL_SPAMCOP_NET    1%

Note that despite this being a non-net run, the overlap still has RDNS_NONE as
the only matching (published) non-net rule that overlapped over 50%.  In a scan
completely lacking network tests, the score-map would be even lower and the
rule would appear more valuable.


Results from 2010-04-10 (net run)
------------------

http://ruleqa.spamassassin.org/20100410-r932679-n/%2FDOS_HIGHB%7CMIME_QP_TO_%7CHAS_XMIME_%7C_1BYTE_B64%7C_ENCODED_B64%7CFROM_SUBJ_

  SPAM%     HAM%     S/O    RANK   SCORE  NAME
 1.1755   0.0116   0.990    0.86    0.01  T_DOS_HIGHBIT_HDRS_BODY_BUG6389
 0.5164   0.0390   0.930    0.76   (n/a)  __SUBJ_1BYTE_B64
 0.0685        0   1.000    0.66    0.01  T_FROM_SUBJ_BODY_8BIT
 0.0682        0   1.000    0.66    0.01  T_FROM_SUBJ_NOTO_BODY_8BIT
 0.0854   0.0435   0.663    0.61   (n/a)  __FROM_1BYTE_B64
 2.3165   2.0477   0.531    0.52   (n/a)  __SUBJECT_ENCODED_B64
 1.3498   1.6534   0.449    0.51   (n/a)  __FROM_ENCODED_B64
 0.0004   0.0099   0.039    0.47   (n/a)  __TO_1BYTE_B64
 6.2616   1.1081   0.850    0.23   (n/a)  __HAS_XMIME_AUTOCONV
 6.1999   1.0350   0.857    0.23   (n/a)  __MIME_QP_TO_8BIT

A breakdown of T_DOS_HIGHBIT_HDRS_BODY_BUG6389 scores:

  scoremap  ham: -2  65.38%   17 **************************
  scoremap  ham:  0  26.92%    7 **********
  scoremap  ham:  1   3.85%    1 *
  scoremap  ham:  4   3.85%    1 *
  scoremap spam:  0   0.05%    7
  scoremap spam:  1   0.20%   29
  scoremap spam:  2   0.78%  113
  scoremap spam:  3   0.55%   80
  scoremap spam:  4   1.11%  161
  scoremap spam:  5   1.56%  226
  scoremap spam:  6   2.57%  373 *
  scoremap spam:  7   3.76%  546 *
  scoremap spam:  8   4.86%  705 *
  scoremap spam:  9   6.58%  955 **
  scoremap spam: 10   7.68% 1114 ***
  scoremap spam: 11   8.85% 1284 ***
  scoremap spam: 12   8.48% 1230 ***
  scoremap spam: 13   8.19% 1188 ***
  scoremap spam: 14   8.07% 1171 ***
  scoremap spam: 15   6.81%  989 **
  scoremap spam: 16   6.02%  873 **
  scoremap spam: 17   5.29%  767 **
  scoremap spam: 18   4.36%  632 *
  scoremap spam: 19   3.41%  495 *
  scoremap spam: 20   2.56%  371 *
  scoremap spam: 21   2.06%  299
  scoremap spam: 22   1.45%  211
  scoremap spam: 23   1.13%  164
  scoremap spam: 24   0.87%  126
  scoremap spam: 25   0.74%  108
  scoremap spam: 26   0.59%   85
  scoremap spam: 27   0.28%   40
  scoremap spam: 28   0.19%   27
  scoremap spam: 29   0.10%   14
  scoremap spam: 30   0.24%   35
  scoremap spam: 31   0.10%   14
  scoremap spam: 32   0.11%   16
  scoremap spam: 33   0.10%   14
  scoremap spam: 34   0.05%    7
  scoremap spam: 35   0.08%   11
  scoremap spam: 36   0.08%   12
  scoremap spam: 37   0.03%    4
  scoremap spam: 38   0.03%    4
  scoremap spam: 39   0.01%    2
  scoremap spam: 40   0.03%    4
  scoremap spam: 41   0.01%    2
  scoremap spam: 42   0.01%    2
  scoremap spam: 43   0.01%    1
  scoremap spam: 47   0.01%    1

Overlap Spam (50% and up)
  x%  of this rule x             also hit this rule y,     y% of y also hit x
 95%  T_DOS_HIGHBIT_HDRS...6389  RCVD_IN_BRBL_LASTEXT      1%
 76%  T_DOS_HIGHBIT_HDRS...6389  T_FSL_HELO_NON_FQDN_2     1%
 73%  T_DOS_HIGHBIT_HDRS...6389  RCVD_IN_PBL               1%
 68%  T_DOS_HIGHBIT_HDRS...6389  RCVD_IN_XBL               1%
 61%  T_DOS_HIGHBIT_HDRS...6389  T_RCVD_IN_ANBREP_BL       1%
 56%  T_DOS_HIGHBIT_HDRS...6389  RAZOR2_CHECK              0%
 54%  T_DOS_HIGHBIT_HDRS...6389  RAZOR2_CF_RANGE_51_100    0%
 53%  T_DOS_HIGHBIT_HDRS...6389  RDNS_NONE                 1%
 51%  T_DOS_HIGHBIT_HDRS...6389  RCVD_IN_BL_SPAMCOP_NET    1%
 50%  T_DOS_HIGHBIT_HDRS...6389  RAZOR2_CF_RANGE_E4_51_100 3%


Conclusion
------------------

This rule is not worthwhile in network-enabled checks.  Without network tests,
this rule may be extremely valuable.  Assuming we're interested in developing
offline-only tests, this is worth revisiting once we have more corpora from
areas that use non-Latin character sets (specifically China), especially if we
can pin it to not fire on network tests.


I have removed the tests from SVN (satisfying comment #14).  They will
disappear from the ruleqa system in the next day or two.

$ svn delete --force 20_bug_6389.cf
D         20_bug_6389.cf
$ svn commit -m "Bug closed.  I posted my observations, including this file's
contents and stats for ent and non-net runs, on bug 6389, comment 14"
20_bug_6389.cf
Deleting       20_bug_6389.cf

Committed revision 933340.
$

--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
Reply | Threaded
Open this post in threaded view
|

Re: [Bug 6389] FPs on DOS_HIGHBIT_HDRS_BODY

Daryl C. W. O'Shea
On 12/04/2010 6:50 PM, [hidden email] wrote:
> https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6389
>
> --- Comment #15 from Adam Katz <[hidden email]> 2010-04-12 18:50:51 EDT ---
> Just a follow-up because I had some investigations running when this was
> closed...

Note that stuff that's under testing, especially when investigating
something we think may be causing harm, should be committed with "tflags
nopublish" so that they don't get auto-promo'd into a rule update.

Daryl